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NEEDLES AND STRAW IN HAYSTACKS: EMPIRICAL BAYES 
ESTIMATES OF POSSIBLY SPARSE SEQUENCES 

By Iain M. Johnstone'^ and Bernard W. Silverman^ 
Stanford University and Oxford University 

An empirical Bayes approach to the estimation of possibly sparse 
sequences observed in Gaussian white noise is set out and investi- 
gated. The prior considered is a mixture of an atom of probabiUty 
at zero and a heavy-tailed density 7, with the mixing weight chosen 
by marginal maximum likelihood, in the hope of adapting between 
sparse and dense sequences. If estimation is then carried out using 
the posterior median, this is a random thresholding procedure. Other 
thresholding rules employing the same threshold can also be used. 
Probability bounds on the threshold chosen by the marginal maxi- 
mum likelihood approach lead to overall risk bounds over classes of 
signal sequences of length n, allowing for sparsity of various kinds and 
degrees. The signal classes considered are "nearly black" sequences 
where only a proportion rj is allowed to be nonzero, and sequences 
with normalized £p norm bounded by rj, for > and < p < 2. Esti- 
mation error is measured by mean gth power loss, for < g < 2. For all 
the classes considered, and for all q in (0,2], the method achieves the 
optimal estimation rate as n — > 00 and 77 ^ at various rates, and in 
this sense adapts automatically to the sparseness or otherwise of the 
underlying signal. In addition the risk is uniformly bounded over all 
signals. If the posterior mean is used as the estimator, the results still 
hold for g > 1. Simulations show excellent performance. For appro- 
priately chosen functions 7, the method is computationally tractable 
and software is available. The extension to a modified thresholding 
method relevant to the estimation of very sparse sequences is also 
considered. 

1. Introduction. 
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1.1. Thresholding to find needles and straw. There are many statistical 
problems where the object of interest is a high-dimensional parameter on 
which we have a single observation, perhaps after averaging, and subject to 
noise. Specifically, suppose that X = {Xi, . . . , X„) are observations satisfying 



where the are A^(0, 1) random variables, not too highly correlated. Let fi 
be the vector of means fi = {fii,fJ-2, ■ ■ ■ , fJ'n)- Clearly, without some knowledge 
of the /ij we are not going to be able to estimate them very effectively, and in 
this paper we consider the advantage that may be taken of possible sparsity 
in the sequence. 

In what contexts do problems of this kind arise? Some examples are the 
following: 

• In astronomical and other image processing contexts, the Xi may be noisy 
observations of the pixels of an image, where it is known that a large 
number of the pixels may be zero. 

• In the model selection context, there may be many different models that 
conceivably contribute to the observed data, but it is of interest to select 
a subset of the possible models. In this case, the individual Xi are the raw 
estimates of the coefficients of the various models, renormalized to have 
variance 1. 

• In data mining, we may observe many different aspects of an individual or 
population, and we are only interested in the possibly small number that 
are "really there" ; this is much the same as the model selection situation, 
but couched in different language. 

• In nonparametric function estimation using wavelets, the true wavelet 
coefficients at each level form a possibly sparse sequence, and the dis- 
crete wavelet transform yields a sequence of raw coefficients, which are 
observations of these coefficients subject to error. Wavelet approaches in 
nonparametric regression take advantage of this structure in a natural 
way. This context originally motivated the work of this paper but the 
potential applicability of the ideas developed is much wider. 

A natural approach to all these problems is thresholding: if the absolute 
value of a particular Xi exceeds some threshold t, then it is taken to corre- 
spond to a nonzero /ij which is then estimated, most simply by Xi itself. If 
\Xi\ < t, then the coefficient \fii\ is estimated to be zero. But how is t to be 
chosen? The importance of choosing t appropriately is illustrated by a sim- 
ple example. Consider a sequence of 10,000 fii, of which m are nonzero and 
(10,000 — m) zero. The nonzero values are allocated at random and are each 
generated from a uniform distribution on (—5,5). By varying the number 
m, sequences of different sparsities can be generated, as shown in Figure 1. 




Xi — fii + ei 
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Fig. 1. Absolute value of parameter images of various sparsity. Out o/ 10,000 pixels, the 
number of nonzero parameters is, from left to right: 5, 20, 100 m the top row and 500, 
2000, 10,000 in the bottom row. Each nonzero parameter is chosen independently from a 
uniform distribution on (—5,5). 



In this figure the 10,000 /Xj are arranged in a 100 x 100 pixel image. The 
absolute value of the image is plotted in gray scale in order to allow white 
to correspond to the value zero. Estimating a sparse signal is like finding 
needles in a haystack; it will be necessary to find which are the very few 
signal values that are nonzero, as well as to estimate them. On the other 
hand, estimating a dense signal is more like finding straw in a haystack; no 
longer will we be surprised if a particular /ij is nonzero. 

Independent Gaussian noise of variance 1 is added to the fii to yield 
a sequence Xi. The resulting images are shown in Figure 2. The average 
square estimation error yielded by thresholding Xi with varying thresholds is 
plotted in Figure 3. Ignore the points marked by arrows for the moment. The 
number in the top right of each panel is the value of m, so m = 5 corresponds 
to a very sparse model, while m = 10,000 corresponds to a very dense model, 
with no zero parameter values at all. The naive estimator, estimating each 
/ij by the corresponding Xi without performing any thresholding at all, will 
produce an expected mean square error of 1. The scales in each panel are 
the same, and the threshold range is from to \/21og 10,000 = 4.292, the 
so-called universal threshold for a sample of this size. 
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Three things can be seen from this figure. First, the potential gain from 
thresholding is very large if the true parameter space is sparse. For the 
sparsest signals considered in Figures 1 and 3, the minimum average square 
error achieved by a thresholding estimate is 0.01 or even less; see Figure 4 
for a graph of minimum average square error against sparsity. Second, the 
appropriate threshold increases as the signal becomes more sparse. For the 
fully dense signal, no thresholding at all is appropriate, while for the sparsest 
signals, the best results are obtained using the universal threshold. Finally, 
it is important for the threshold to be tuned to the sparsity of the signal; if 
a threshold appropriate for dense signals is used on a sparse signal, or vice 
versa, the results are disastrous. 

Thus, thresholding is a very promising approach, but the crucial aspect 
is the choice of threshold. A good threshold choice method will have several 
properties, as follows: 

• It will be adaptive between sparse and dense signals, between finding 
"needles" and finding "straw." 

• It will be stable to small changes in the data. 

• It will be tractable to compute, with software available. 

• It will perform well on simulated data and on real data. 
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Fig. 3. Mean square error of thresholding data obtained from the images m Figure 1 by 
adding Gaussian white noise. In each panel the arrow indicates the threshold chosen by 
the empirical Bayes approach. The prior used for the nonzero part of the distribution was 
a Laplace distribution with scale parameter a = i . Each plot is labeled by the number of 
nonzero pixels, out of 10,000, in the underlying signal. 




number of nonzero parameters number of nonzero parameters 

Fig. 4. Left panel: threshold plotted against sparsity. The solid line is the threshold cho- 
sen by the empirical Bayes method, while the dashed line is the threshold that yields the 
minimum possible average square error. Right panel: log base 10 of the average square er- 
ror yielded by the empirical Bayes threshold (solid line) and by the best possible threshold 
(dashed line). The models illustrated in Figure 1, and intermediate models, were used to 
construct these graphs. 
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• It will have good theoretical properties. 

In this paper we set out and investigate a fully automatic empirical Bayes 
thresholding method, which satisfies all these desiderata. In the example 
the method chooses the threshold values shown by the arrows in Figure 
3. It can be seen that the empirical Bayes method is very good at track- 
ing the minimum of the average square error. More details are given in 
Figure 4. The empirical Bayes thresholds are always close to the optimal 
thresholds, and — right across the range of sparsity considered — the aver- 
age square error obtained by the empirical Bayes threshold is very close 
indeed to the best attainable average square error. A documented imple- 
mentation EbayesThresh of our methodology in R and S-PLUS is available. 
See Johnstone and Silverman (2003) for details. 

1.2. Specifying the empirical Bayes method. In the present paper we con- 
centrate attention on the case where the errors are independent. In some 
contexts this assumption is restrictive. While beyond the scope of the present 
paper, it is of obvious interest to extend our method and the supporting the- 
ory to dependent data, and this is a natural topic for future work. 

The notion that many or most of the /ij are near zero is captured by 
assuming that the elements fii have independent prior distributions each 
given by the mixture 

(2) /prior (^) = (1 -^y)5o(/u) +'U;7(/i). 

The nonzero part of the prior, 7, is assumed to be a fixed unimodal symmet- 
ric density. In most previous work in the wavelet context mentioned above, 
the density 7 is a normal density, but we shall see that there are advantages 
in using a heavier-tailed prior, for example, a double exponential distribution 
or a distribution with tails that decay at polynomial rate. 

For any particular value of the weight w, consider the posterior distribu- 
tion of fi given X = x under the assumption that X ~ A^(/-f, 1). Let jl{x;w) 
be the median of this distribution. For fixed w <1, the function fi(x;w) 
will be a monotonic function of x with the thresholding property, in that 
there exists t{w) > such that jl{x;w) = if and only if \x\ < t{w). Fig- 
ure 5 shows the prior distribution and the posterior median function jl(x;w) 
for the Laplace mixture prior with a = 0.5 and two different values of the 
weight w. 

Let g denote the convolution of the density 7 with the standard normal 
density (p. The marginal density of the observations Xi will then be 

(1 — w)(p{x) + wg{x). 

We define the marginal maximum likelihood estimator w of w to be the 
maximizer of the marginal log likelihood 

n 

iiw) = log{(l - + wgiXi)} 

i=l 
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Fig. 5. First line: Prior distribution for w — 0.4 and w = 0.02, for the mixed Laplace 
prior with a — 0.5. The atom of probability at zero is represented by the solid vertical bar, 
plotted to the scale indicated on the right of the plot; the probability density of the nonzero 
part of the prior is plotted to the scale at the left. Second line; Posterior median functions 
for the same priors. The dotted line is the diagonal y — x. It can be seen that the posterior 
median is a monotonic function of the data value and is zero whenever the absolute value 
of the datum is below the threshold. 



subject to the constraint on w that the threshold satisfies t{w) < \/21ogn. 
The threshold chosen by the method will then be the value t{w). 

The function l!.'{w) is a monotonic function of ^y, so its root is very easily 
found numerically, provided the function g is tractable; see Section 2.2. Our 
basic approach will then be to plug the value w back into the prior and 
then estimate the parameters /Xj using this value of w, either using the 
posterior median itself, or by using some other thresholding rule with the 
same threshold t{w). In the example above simple hard thresholding was 
used. 

Another possibility is to use the posterior mean, which we denote w), 
so that the corresponding estimate is jli = fl{Xi;w). The posterior mean 
rule fails to have the thresholding property, and, hence, produces estimates 
in which, essentially, all the coefficients are nonzero. Nevertheless, it has 
shrinkage properties that allow it to give good results in certain cases. We 
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shall see that both in theory and in simulation studies, the performance of 
the posterior mean is good, but not quite as good as the posterior median. 

The empirical Bayes is a fully automatic practical method; intuitively, the 
reason it works well is as follows. If the means //j are all near zero, then w 
will be small, corresponding to a large threshold t{w), so that most of the 
means will be estimated to be zero. On the other hand, if the /ij are larger, 
then a small threshold will be chosen, and the data will not be shrunk so 
severely in the estimation of the vector of means. 

1.3. Measures of sparsity and minimax rates. The sparsity of a signal is 
not just a matter of the proportion of /ij that are zero or very near zero, but 
also of more subtle ways in which the energy of the signal /z is distributed 
among the various components. Our theory will demonstrate that the empir- 
ical Bayes choice of estimated threshold yields a highly adaptive procedure, 
with excellent properties for a wide range of conditions on the underlying 
signal. 

A natural notion of sparsity is the possibility that ;U is a nearly black 
signal, in the sense that the number of indices i for which Hi is nonzero is 
bounded. We define 

(3) 4W = |A^:n-if^7[/i,/0]<r?|. 

With just the knowledge that /i falls in lo[r]], how well can /x be estimated? 
Define the minimax average square error by 

n 

^n,2(^oW) =inf sup n~'^^E{fj,i- flif. 
1^ fieeoiv] i=i 

Donoho, Johnstone, Hoch and Stern (1992) show that, considering r] = rjn ^ 
as n — > oo, Rnfi is 2r](logr]~^){l + o(l)). 

A more subtle characterization of sparsity will not require any /ij to be 
exactly zero, but still constrain most of the energy to be concentrated on a 
few of the Hi, by placing bounds on the p-norm of for p > 0. There are 
various intuitive ways of understanding why = {J2 lA'il^)"^^^ for small p 
is related to the sparsity of //. Perhaps the simplest is to consider the energy 
(the sum of squares) of a vector with ||/u||p = 1 for some small p. If only one 
component of fi is nonzero, then the energy will be 1. If, on the other hand, 
all the components are equal, then the energy is n^~^/^ which tends to zero 
as n — > oo if p < 2, rapidly if p is near zero. By extension of these examples, 
if p is small, the only way for a signal in an ip ball with small p to have large 
energy is for it to consist of a few large components, as opposed to many 
small components of roughly equal magnitude. Put another way, among all 
signals with a given energy, the sparse ones are those with small ip norm. 
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In this case we suppose the signal belongs to an £p norm ball of small 
radius rj, 

(4) ^,M = {/.:n-i^|/.,r<7f}, 

and define the minimax square error 

n 

-Rn,2(^2M) =inf sup n~^^E{fii- iiif. 

M ^,&^oH i=i 

Again, considering ^ as n ^ oo, Donoho and Johnstone (1994) show 
that, for p< 2, Rn,i{lp[ri\) is 7?P(21ogr/-P)i-P/2(l + o(l)). 

The estimator that attains the ideal performance over a nearly black class, 
or over an Hp ball for some p > 0, will in general depend on p and on ry. The 
minimax rate is a benchmark for the estimation of signals that display the 
sparseness characteristic of membership of an (.p class. Our main theorem will 
show that, under mild conditions, an empirical Bayes thresholding estimate 
will essentially achieve the minimax rate over rj simultaneously for all p in 
[0,2], including the nearly black class as the case p = 0. In this sense it 
adapts automatically to the degree and character of sparsity of the signal in 
the optimum possible way. 

A particular minimax risk is the risk when there is no constraint at all 
on the underlying signal. In this case the minimax asymptotic risk is a con- 
stant 1, for example, achieved by the estimator that simply estimates /ij by 
Xi. We show that the maximum possible risk of the empirical Bayes thresh- 
olding method, under appropriate conditions, is also uniformly bounded, so 
the adaptivity is not bought at the price of asymptotically unbounded risk 
for signals of certain kinds. 

1.4. Robustness. While adaptivity of an estimator is obviously desirable, 
it is also important that the estimator should be robust to assumptions 
made. There are several aspects of such robustness that we demonstrate for 
the empirical Bayes threshold estimator. 

Assumptions on the signal: Although our procedure is derived from the 
sparse prior model (2), we derive results under the much weaker assumption 
that the underlying signal belongs to an appropriate ip ball. 

Assumptions on the noise: For example, in Section 5 we relax the as- 
sumption of Gaussian errors in order to investigate the relation between 
tails of the prior and tails of the noise density. While, in their present form, 
some other aspects of our subsequent discussion make use of Gaussian as- 
sumptions, the key properties of the posterior median thresholding rule hold 
under considerably weaker assumptions. 
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Assumptions on the error measure: Rate-optimal risk bounds are estab- 
lished for mean qth power error measures for all q 6 (0,2], not just for the 
standard mean square error. Excessive reliance on mean square error (g = 2) 
is often criticized, for example, as not corresponding to visual assessments 
of error. Choices of q <2 will give greater (relative) weight to small errors, 
and in some sense, the g — > limit corresponds to counting the number of 
errors I{fii ^ m}. 

Assumptions on the estimator itself: While the posterior median is the 
motivating estimator for our work, the exact form of the thresholding rule 
is not specified in our theoretical discussion. The key point is that the data 
dependent threshold is chosen according to the sparse empirical Bayes pre- 
scription. Indeed, the processing rule does not even have to be a strict 
thresholding rule. We obtain good results for the posterior mean, which 
is not a thresholding rule but still possesses an appropriate bounded shrink- 
age property; however, for full robustness to the choice of error measure, 
strict thresholding rules have to be used. 

1.5. Related work. Abramovich, Benjamini, Donoho and Johnstone (2000) 
show that the false discovery rate approach provides adaptive asymptotic 
minimaxity at the level of exact constants, as well as the rates of convergence 
that we demonstrate for the empirical Bayes method. However, their results 
do not guarantee robustness for denser signals, and there is some evidence 
of this nonrobustness in the simulations we report in Section 3. 

In a more restrictive scenario than ours, and mainly concentrating on 
the application to wavelet smoothing, Zhang (2004) provides an asymptot- 
ically more sharply adaptive empirical Bayes analysis. This analysis uses 
much more general families of priors than our simple mixtures, and employs 
nonparametric infinite-order kernel methods to estimate the corresponding 
marginal densities. Such methods are complex to implement in software, 
and their sharp asymptotic properties might not be apparent in moderate 
samples. 

Mixture priors built from models such as (2) are quite common in Bayesian 
variable selection problems: our interest was stimulated in part by analysis 
of a proposal due to George and Foster (1998, 2000) which takes 7 to be 
Gaussian. For further references specifically in the wavelet setting, see the 
companion paper Johnstone and Silverman (2004). 

1.6. Outline of the paper. The paper now proceeds as follows. In Section 

2 we set out some key definitions and state the main theorem of the paper. To 
show that the advantages of the estimate are not just theoretical, in Section 

3 a simulation study is presented, comparing the empirical Bayes method 
with a range of other estimators, on cases covering both sparse and dense 



EMPIRICAL BAYES FOR SPARSE SEQUENCES 



11 



signals. In this study the theoretical adaptivity and robustness properties of 
the empirical Bayes method are clearly borne out. In very sparse cases the 
theory suggests that some asymptotic improvement may be possible for very 
sparse signals, and in Section 4, we set out a modification of our standard 
procedure, whereby the threshold is increased by a suitable factor when the 
signal is estimated to be very sparse. We state a result giving key properties 
of this procedure, and also present some discussion and numerical results 
that suggest that, except when the sample size is very large indeed, the 
modification may be of theoretical interest only. 

We then move to the proofs of the main results. In Section 5 various de- 
tailed preliminaries are considered, including the properties of the posterior 
rules under more general noise distributions than the Gaussian. We then 
go on, in Section 6, to consider risk bounds first for fixed thresholds, and 
then for data-dependent thresholds. These bounds depend on tail probabil- 
ities for the random thresholds. As a prerequisite to the control of these 
probabilities. Section 7 investigates properties of the moment behavior of 
the marginal likelihood score function. In Section 8 the proof of the main 
theorem is completed: the results of Section 7 yield tail probabilities of the 
prior parameters chosen by the empirical Bayes method, and, hence, of the 
corresponding random thresholds. These are fed into the bounds of Section 
6 to complete the proof. Section 9 then contains the modifications to the 
previous arguments needed to prove Theorem 2. 

The conditions in the main theorem for the posterior mean do not cover as 
wide a range of loss functions as for strict thresholding rules. In Section 10 it 
is shown that this is an essential feature of the use of such a rule; for values 
of g < 1 the posterior mean cannot yield an optimal estimate relative to qth 
power loss under the same broad conditions. 

2. Aspects of the sequence estimation problem. It is convenient to set 
up some notational conventions. Where and are numerical quanti- 
ties depending on a discrete or continuous index r, we write Ar x Br to 
denote < lim'mir Ar/ Br < limsupr Ar/Br < oo, and Ar ~ Br to denote 
Ar/ Br 1. We use 4> and $ for the standard normal density and cumula- 
tive, respectively, and set $ = 1 — <I>. When there is no confusion about the 
value of the prior weight w, it may be suppressed in our notation. Use c and 
C to denote generic strictly positive constants, not necessarily the same at 
each use, even within a single equation. We adopt the convention that c is 
an absolute constant, while the use of C will indicate a possible dependence 
on the prior density component 7. 

2.1. Assumptions on the prior. When using the mixture prior (2), we 
shall see that there are considerable advantages in using a heavy-tailed den- 
sity for 7, for example, the Laplace density 



(5) 
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or the mixture density given by 

(6) {n\e = e)^ N{0,6~^ - I) with e~Beta(a,l). 

The latter density for fi has tails that decay as ^u"^"""*^, so that, in particular, 
if a = ^, then the tails will have the same weight as those of the Cauchy 
distribution. To be explicit, this has 

In both cases (5) and (6) the posterior distribution of n given an observed 
X , and the marginal distribution of X, are tractable, so that the choice of 
w by marginal maximum likelihood, and the estimation of /i by posterior 
mean or median, can be performed in practice, using the approach outlined 
in Section 2.2. Details of the relevant calculations for particular priors are 
given by Johnstone and Silverman (2004). 

Throughout the paper we will assume that the nonzero part of the prior, 
7, has a fixed unimodal symmetric density. In addition, we will assume that 

^ log7(u) 



(7) sup 

M>0 



du 



A < oo. 



It follows from this assumption that, for u> 0, logj{u) > log 7(0) — An, so 
that, for all u, 

(8) 7(n) > 7(0)e-^l"L 

Thus, the tails of 7 have to be exponential or heavier, and the Gaussian 
model for 7 is ruled out. We will also assume that the tails of 7 are no 
heavier than Cauchy, in the sense that v?'^{u) is bounded over all u. Finally, 
we make the mild regularity assumption that, for some k G [1,2], 

;>oo 

(9) liv)'^ / ^{u) du >i y'^~^ asy— >oo. 

If 7 has asymptotically exponential tails, then k = 1. If 7(7/) x for large 
y, then the tail probability is asymptotic to y~^ and k = 2. Any Pareto tail 
behavior gives the value k = 2. 

2.2. Finding the estimate. Define the score function S{w) =i'{w), and 
define 

(10) /3(x) = 4^-l and (3{x,w) - ^^^^ 



(t){x) ' ' l + wp{x) 



so that 
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Since by elementary calculus (3{x, w) is a decreasing function of w for each 
X, the function S{w) is also decreasing. Let Wn be the weight that satisfies 
t{wn) = V^logn. If S{wn) > and <S'(1) < 0, then the zero of S in the range 
[wn,!] is the estimated weight w. Furthermore, the sign of S{w) for any 
particular w specifies on which side of w the estimate w lies. [Note that S 
will be strictly decreasing except in the pathological case where P{Xi) = 
for all i, when S{w) = for all w and the likelihood is constant.] 

The marginal maximum likelihood approach can be used to estimate other 
parameters of the prior. In particular, if a scale parameter a is incorporated 
by considering a prior density (1 — w)5o{^) + wa'y(afj,), define ga to be the 
convolution of 07(0 •) with the normal density. Then both a and w can be 
estimated by finding the maximum over both parameters of 

n 

iiw,a) = 5]log{(l - w)cl){Xi) + wga{X,)}. 

i=l 

If 7 is the Laplace density, the tractability of the procedure is not affected 
by the inclusion of a scale parameter into the prior. In this case if one is 
maximizing over both w and o, then a package numerical maximization 
routine that uses gradients has been found to be an acceptably efficient way 
of maximizing i(w,a). 

In the current paper we will not develop theory for the case where addi- 
tional parameters of 7 are estimated, but we will include the possibility of 
estimating a scale parameter in the simulation study reported in Section 3. 

The R/S-PLUS software package EbayesThresh [Johnstone and Silverman 
(2003)] includes a routine that performs empirical Bayes thresholding on a 
vector of data. It allows the use of either the Laplace or the quasi-Cauchy 
prior, and in the case of the Laplace prior, the scale parameter can if desired 
be chosen by marginal maximum likelihood. Estimation may be carried out 
using the posterior median or posterior mean rule, or by hard or soft thresh- 
olding. In addition, there are several routines that will allow users to develop 
other aspects of the general approach. 

2.3. Shrinkage rules. We begin with some definitions, leading up to the 
statement of the main theorem of the paper. A function 6{x, t) will be called 
a shrinkage rule if and only if 5{-,t) is antisymmetric and increasing on 
(—00, 00) for each t>0, and 

(12) < 6{x, t)<x for ah x > 0. 

The shrinkage rule 6{x,t) will be a thresholding rule with threshold t if and 
only if 

(13) 6{x,t) = if and only if < t, 
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and will have the bounded shrinkage property relative to the threshold t if, 
for some constant b, 

(14) \x - 6{x,t)\ <t + b for all x and t. 

For any given weight w, the posterior median will be a thresholding rule 
and will have the bounded shrinkage property if |(log7)'| is bounded; see 
Lemma 2(v). In Section 5.5 it is demonstrated that the posterior mean for 
the same weight will have the same bounded shrinkage property, but will not 
be a strict thresholding rule. If the hyperparameter w is chosen by marginal 
maximum likelihood, both are examples of rules with random threshold 
i=t{w). 

2.4. Risk measures and the main result. As already mentioned, we do 
not restrict attention to losses based on squared errors, but we measure risk 
by the average expected qth power loss 

n 

(15) i2g(/i,/x)=n-i^£;|A»-Aiir, 0<q<2. 

1=1 

Note that the posterior median and mean estimators for prior (2) are Bayes 
rules for the q = l and q = 2 error measures, respectively. 

We set two goals for estimation using the empirical Bayes threshold: "uni- 
form boundedness of risk" and "flexible adaptation." To explain what we 
mean by flexible adaptation, suppose that the signal is sparse in the sense 
of belonging to an ip norm ball ip[rj] as defined in (4). As before, we include 
nearly black classes as the case p = 0. If the radius ij is small, we would hope 
that the estimation error Rq{fl,fj,) should be appropriately small. How small 
is benchmarked in terms of the minimax risk 

-Rn,g(^p[??]) =inf sup Rg{fL,fl). 

Suppose r7 = ?7„— >Oasn^oo but that, in the case q> p> 0, 

(16) n-%-Hlogr/-P)i/2^0, 

which prevents rj from becoming very small too quickly. (For p = we require 
ni] ^ oo.) Then we have the asymptotic relation 

(17) Rn,g{^p[nn]) rp^g{7]n) aS n ^ OO, 

where 

{r/'?, 0<q<p, 

r?P(21og7?-P)('?~P)/2, 0<p<q, 

r?(2 log 77-^)^/2^ p = 0,q>0. 
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The relation (17) is proved by Donoho and Johnstone (1994) for the case 
p > and q > 1, but only minor modifications are needed to extend the 
result to all the cases we consider. 

We can now state our main result, which gives comparable bounds on the 
risk function of the empirical Bayes thresholding procedure. Apart from an 
error of order {\ogn)'^^^''~'P^^'^ , the procedure uniformly attains the same 
error rate as the minimax estimator for all p in [0,2] and q in (0,2]. 

Theorem 1. Suppose that X Nn{fi,I), that 6{x,t) is a thresholding 
rule with the bounded shrinkage property and that <p <2 and < q <2. 
Let w be the weight chosen by marginal maximum likelihood for a mix- 
ture prior (2) with 7 satisfying the assumptions set out in Section 2.1. Let 
t = t{w), where t{w) denotes the threshold of the posterior median rule cor- 
responding to the prior weight w. Then the estimator fii{x) = 5{xi,t{w)) 
satisfies: 

(a) (^Uniformly bounded riskj There exists a constant Co((?,7) such that 

sup RqifL, fx) < Co- 

(b) ( AdaptivityJ There exist constants Ci{p, q, 7) such that for rj < r]Q{p, q, 7) 
and n > nQ{p, q, 7), 

(19) sup Rg{fL,fi) < Cirp,g{ri) + C2n-\\ognf+^''"P^/\ 

When q £ (1,2], these results also hold for the posterior mean estimate fi. 

We emphasize that it is not necessary that 5{x,t) be derived from the 
posterior median or mean rule. It might be hard or soft thresholding or some 
other nonlinearity with the stated properties. The point of the theorem is 
that empirical Bayes estimation of the threshold parameter suffices with all 
such methods to achieve both adaptivity and uniformly bounded risk. 

If g > p > 0, then we necessarily have p < 2, and the first term of (19) 
dominates if 77^ > n~^log^n and the second if rjP < n~^log^n. It follows 
that the result is equivalent to 

^ ^ .'m/' lC7n-niogn)2+{'^-P)/2, if r?^ < n^^ log^ n. 

Note that rjP > n~^log^n is a sufficient condition for (16). For the nearly 
black case p = 0, a similar argument leads to (20) with i]^ replaced by rj. 
li p>q, the bound can be written as 

(21) sup i2g(/i,//) <Cmax{r?'',n^i(logn)2+('?-P)/2} 
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and the "break-even" point between the two bounds occurs at a value of 
r] bounded above by 77^ = log^ n. It remains the case that for rj^ > 
n~^log^n the supremum of the risk is bounded by a multiple of rp^q{r}). 
Therefore, for every p and q in (0,2], and for the nearly black case p = 0, 
our estimator attains the optimal g-norm risk (18), up to a constant mul- 
tiplier, for all sufficiently large n and for r/ satisfying log^ n < rj^ <rjQ 
if p > and log^ n <r] <r]Q if p = 0. 

3. Some simulation results. In order to investigate the capability of the 
empirical Bayes method to adapt to the degree of sparsity in the true sig- 
nal, a simulation study was carried out. We approach the issue of sparsity 
directly, by explicitly constructing sequences with a wide range of sparse 
behavior. The S-PLUS code used to carry out the simulations is available 
from the authors' web sites, enabling the reader both to verify the results 
and to conduct further experiments if desired. 

As an initial range of models for sparse behavior, we fixed the sample size 
n to 1000. We considered the estimation of a sequence /i which has /ij = 
except in K randomly chosen positions, where it takes a specified value /.io- 
For each i, a data value Xi ~ N{iJ,i, 1) is generated, and various methods are 
used to estimate the sequence from the sequence of Xi. 

The parameter K controls the sparsity of the signal, and the values for 
which results are reported are 5, 50 and 500 — ranging from a very sparse 
signal, indeed, to one in which half the data contain nonzero signal. The 
other parameter fiQ gives the strength of the signal if it is nonzero. The 
values reported were 3, 4, 5 and 7, bearing in mind that the noise is A^(0, 1). 
One hundred replications were carried out for each of the values of K and 
fiQ, with the same 100,000 noise variables used for each set of replications. 

The posterior median estimator was used, with the prior parameters cho- 
sen by marginal maximum likelihood for two different functions 7 for the 
nonzero part of the prior. The double exponential ^{u) = ^aexp(— a|u|) was 
used with both the parameter a and the prior weight w chosen by marginal 
maximum likelihood. For comparison, the heavy-tailed mixture density with 
Cauchy tails, as defined in (6) with a = ^, was also considered. For both 
choices of the function 7, the performance of the posterior median as a 
point estimator was studied. For double exponential 7 with both parameters 
estimated, two other estimators were also considered, the posterior mean, 
and hard thresholding with threshold equal to that of the posterior median 
function. In addition, the effect of fixing the scale parameter in the double 
exponential was investigated by considering four different values of a; in 
each case w was chosen by marginal maximum likelihood and the posterior 
median estimator used. 

These methods were compared with classical soft and hard universal 
thresholding (using the threshold \/2Togn w 3.716) and with three other 
methods intended to be adaptive to different levels of sparsity. 
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The SURE method [Donoho and Johnstone (1995)] aims to minimize the 
mean squared error of reconstruction, by minimizing Stein's unbiased risk 
estimate for the mean squared error of soft thresholding. Thus, we choose 
tsURE as the minimizer (within the range [0,\/2Togn]) of 

n n 

U{t) = n + ^ 4 A t2 - 2 ^ I{xl < 
1 1 

This is based on the unbiased risk estimator of Stein (1981) in the estima- 
tion of a multivariate normal mean. In addition, a modification proposed by 
Donoho and Johnstone (1995) aimed at gaining greater adaptivity is con- 
sidered; this chooses between the SURE and universal thresholds according 
to the result of a test for sparsity; see also Section 6.4.2 of Bruce and Gao 
(1996) for details. 

The false discovery rate (FDR) approach is derived from the principle 
of controlling the false discovery rate in simultaneous hypothesis testing 
[Benjamini and Hochberg (1995)] and has been studied in detail in the esti- 
mation setting, for example, by Abramovich, Benjamini, Donoho and Johnstone 

(2000) . Order the data by decreasing magnitudes: 

> |2;|(2) > ••• > \x\{n) 

and compare to a quantile boundary: 

tk=crz{q/2 ■ k/n), 

where the false discovery rate parameter g G (0, 1 /2] . Define a crossing index 

kp = max{A; : \x\^f.-^ > t^} 

and use this to set the threshold tp = tj,^. Various values for the rate pa- 
rameter q were used. 

Block thresholding methods are designed to make use of neighboring in- 
formation in setting the threshold applied to each individual data point. We 
considered the BlockThresh method of Cai (2002) and the hard thresholding 
versions of the NeighBlock and NeighCoeff methods of Cai and Silverman 

(2001) . The principle of all these methods is to consider the data in blocks. 
BlockThresh thresholds all the data in each block by reference to the sum 
of squares of the data in the block. The other two methods use overlapping 
blocks and keep or zero the data in the middle of each block according to 
the sum of squares over the whole block. See the original papers for more 
details. 

For each method considered, for each replication the total squared error 
of the estimation X](Ai ~ ^j)^ was recorded, and the average over 100 repli- 
cations is reported. The square error of every replication is available from 
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Table 1 

Average of total squared error of estimation of various methods on a mixed signal of 
length 1000. A given number of the original signal values is set equal to a nonzero value, 
and the remainder are zero. In each column those entries that outperform the 
MML /exponential/posterior median method are underlined. Those that outperform by 
more than about 10% are set in bold type. The row marked "postmean" refers to the 
posterior mean using the double exponential model. The row "exphard" refers to hard 
thresholding using the threshold given by the posterior median of the marginal maximum 

likelihood choice within the double exponential model. The rows for fixed values of a 
correspond to the posterior median where only the weight w is chosen by MML and the 
scale parameter a is fixed at the given value 



Number nonzero 




5 






50 






500 




Value nonzero 


3 


4 


5 


7 


3 


4 


5 


7 


3 


4 


5 


7 


Exponential 


36 


32 


17 


8 


214 


156 


101 


73 


857 


873 


783 


658 


Cauchy 


37 


36 


1 S 


Q 
O 


071 


176 


103 


77 


922 


898 


829 


7/1 Q 


Postmean 


34 


32 


21 


11 


201 


169 


122 


85 


860 


888 


826 


708 


Exphard 


51 


43 


22 


11 


273 


189 


130 


91 


998 


998 


983 


817 


a= 1 


36 


32 


19 


15 


213 


166 


142 


135 


994 


1099 


1126 


1130 


a = 0.5 


37 


34 


17 


10 


244 


158 


105 


92 


845 


878 


884 


884 


a = 0.2 


38 


37 


18 


7 


299 


188 


95 


69 


1061 


730 


665 


656 


a = 0.1 


38 


37 


18 


6 


339 


227 


102 


60 


1496 


798 


609 


579 


SURE 


38 


42 


42 


43 


202 


209 


210 


210 


829 


835 


835 


835 


Adapt 


42 


63 


73 


76 


417 


620 


210 


210 


829 


835 


835 


835 


FDR g = 0.01 


43 


51 


26 


5 


392 


299 


125 


55 


2568 


1332 


656 


524 


FDR 5 = 0.1 


40 


35 


19 


13 


280 


175 


113 


102 


1149 


744 


651 


644 


FDR g = 0.4 


58 


58 


53 


52 


298 


265 


256 


254 


919 


866 


860 


860 


BlockThresh 


46 


72 


72 


31 


444 


635 


600 


293 


1918 


1276 


1065 


983 


NeighBlock 


47 


64 


51 


26 


427 


543 


439 


227 


1870 


1384 


1148 


972 


NeighCoeff 


55 


51 


38 


32 


375 


343 


219 


156 


1890 


1410 


1032 


870 


Universal soft 


42 


63 


73 


76 


417 


620 


720 


746 


4156 


6168 


7157 


7413 


Universal hard 


39 


37 


18 


7 


370 


340 


163 


52 


3672 


3355 


1578 


505 


the authors' 


web sites 


for 


any 


reader 


who 


wishes to 


examine the result 


s in 



more detail. 

Some results are given in Table 1 and the following conclusions can be 
drawn: 

• The Cauchy method is always nearly, but not quite, as good as the ex- 
ponential method. Our theory is not sensitive enough to discriminate be- 
tween the two methods. 
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In general, the posterior mean does not perform quite as well as the pos- 
terior median. 

It is better to use the posterior median function itself rather than hard 
thresholding with the resulting threshold. 

In the case fJ.o = 7 where the nonzero signal is very clearly different from 
zero, hard thresholding with the universal threshold performs somewhat 
better than the exponential method, but in other cases, particularly with 
moderate or large amounts of moderate sized signal, it can give disastrous 
results. 

Estimating the scale parameter a is probably preferable to using a fixed 
value, though it does lead to slower computations. In general, the auto- 
matic choice is quite good at tracking the best fixed choice, especially for 
a sparse and weak signal. 

SURE is a competitor when the signal size is small (/^o = 3), but performs 
poorly when /xq is larger, particularly in the sparser cases. The attempt 
to make SURE more adaptive is counterproductive. 

If q is chosen appropriately, FDR can outperform exponential in some 
cases, but the choice of q is crucial and varies from case to case. With the 
wrong choice of q, the performance of FDR can be poor. 
The block thresholding methods do not perform very well. In the compan- 
ion paper [Johnstone and Silverman (2004)] block thresholding methods 
are also compared with empirical Bayes methods for the thresholding of 
wavelet coefficients, and the difference in performance is not so great. This 
is presumably because there is some correlation among the positions in 
which the wavelet coefficients are effectively nonzero. By contrast, in the 
test signals under current consideration, the nonzero positions are chosen 
by uniform random sampling without replacement. 

The median standard error of the entries of the table with 5 nonzero 
coefficients is around 1, with corresponding figures of about 3 for those 
with 50 nonzero coefficients, and 5 for the entries with 500 nonzero coeffi- 
cients. Generally speaking, the standard errors tend to be smaller for the 
empirical Bayes methods than for the other methods considered; the false 
discovery rate and block thresholding methods have errors that have vari- 
ance two to three times as large as the double exponential MML posterior 
median method, and for the universal thresholding methods the variance 
is higher by a factor of about 5. This is an indication of the stability of 
the empirical Bayes methods. 

Not surprisingly, given that the same data are used for all cases, the 
standard error of the comparison between the first method and the other 
methods in the table is typically smaller than that for individual entries 
taken alone. The comparison standard error has a median value of 0.8 for 
the sparsest signals and about 2 for the signals with 50 and 500 nonzero 
elements. In general, comparisons between empirical Bayes methods have 
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somewhat smaller standard errors than those involving other approaches. 
Only about 10% of the comparisons between the top line and other entries 
in the table are within 3 standard errors of zero, and all the comparisons 
that are numerically more than trivial are clearly statistically significant 
on the basis of the study we have carried out. 

The two SURE methods, the FDR method with q = 0.01 oi q = 0.4, and 
the two universal thresholding methods all have the property that there is 
a case in which their measured error is around three or more times that of 
the exponential method, while never, or hardly ever offering any substantial 
improvement. Hence, all are much worse at adapting to different patterns 
of sparsity. The FDR method with g = 0.1 is a better competitor, but only 
wins in four of the twelve cases. The best improvement over exponential is 
651/783, a 17% improvement, while the best improvement of the exponential 
over the FDR method is 73/102, nearly 30%. Taking both adaptivity and 
overall performance into account, the exponential is clearly the estimator of 
choice. 

In order to quantify the comparison between the various methods, for 
each of the models considered define the inefficiency of a method A for a 
particular model B to be 

"average error for method A applied to data from model B 

100 X — 1 . 

minimum error for any method for model B 

Twelve different models are considered in Table 1, and summary statistics 
for the twelve inefficiency values for the various methods are given in Ta- 
ble 2. The posterior median of the exponential model with estimated scale 
parameter is the best on nearly every measure: the maximum inefficiency of 
the Cauchy and exponential (a = 0.2) methods is slightly smaller, but both 
of these methods are decisively beaten on the median inefficiency and are 
also equaled or beaten on the other two measures. 

4. Modifying the threshold for very sparse signal. In this section we 
discuss a possible modification of the estimator, which allows a reduction 
in error in very sparse cases, when the overwhelming majority of compo- 
nents have essentially zero signal. Our original motivation for this arises 
from the use of wavelet methods to estimate derivatives, where it was shown 
by Abramovich and Silverman (1998) that the appropriate universal thresh- 
old is not \/21ogn, but is a multiple of this quantity. The basic notion of 
the modified estimator is this: if the threshold t = t{w) estimated by the 
marginal maximum likelihood method is at or near the universal threshold, 
we replace it by a higher threshold. 
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Table 2 

Comparison of methods: for each method the 
stated median, mean, maximum and tenth 
largest inefficiency is over the twelve cases 
considered in Table 1 





mGdiciii 


niGciii 


lULIi 


nicix 


Exponential 


7 


17 


30 


52 


Cauchy 


19 


25 


42 


47 


Postmean 


22 


27 


40 


95 


Exphard 


37 


46 


62 


93 


a= 1 


35 


57 


124 


165 


a = 0.5 


15 


29 


75 


84 


a = 0.2 


18 


19 


30 


48 


a = 0.1 


14 


24 


45 


80 


SURE 


35 


121 


151 


676 


Adapt 


103 


223 


303 


1282 


FDR q = 0.01 


44 


56 


91 


210 


FDR <j = 0.1 


18 


35 


39 


139 


FDR q = 0.4 


71 


169 


214 


847 


BlockThresh 


129 


228 


456 


531 


NeighBlock 


119 


181 


335 


376 


NeighCoeff 


106 


136 


131 


486 


Universal soft 


529 


643 


1282 


1367 


Universal hard 


50 


100 


159 


359 



4.1. Definition of the modified estimator and theoretical discussion. To 
be precise, set t"^ = 21ogn — 5 log log n. Let ^ > be fixed and put = 
y/2{l + A) logn. Then define 

iA = l^' ift<t„, 
\tA, ift>t„. 

With this modification of the threshold, we can reduce the order of mag- 
nitude of the part of the error in Theorem 1, as follows. 

Theorem 2. Under the assumptions of Theorem 1, let jxA he defined by 
f^A,i{x) = 6{xi,iA)- Define n^q{ri) = sup^^^^^^] Rq{/lA,i, fi)- Then, for suitable 
constants C , 



(22) 



n^,q{ri)<C for all 
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and, for all sufficiently large n and for suitable rjo, 

(23) 7^^,(r/)<C7max{rp,g(r/),n-l-^(logn)('^-l)/2} for7]<rjo. 

For q> p> we also have, for sufficiently large n, 

T^iqiv) < Cmax{n(«-P)/V,n-^-^(logn)(^~i)/2| 

(24) 

for vjP < n^^{logn)P/'^ . 

The ramifications of this theorem in the wavelet context are explored by 
Johnstone and Silverman (2004), but it has independent interest in exposing 
the different regimes for adaptive estimation, especially in the case q > p. 
Note first that conclusion (22) is the same as for the unmodified estimator, 
and in the range r/^ > n~^log^n for p > {r] > n~^\og'^n for p = 0) so is 
(23), because in that range the dominating term in the error is rp^q{r]) for 
both estimators. 

For q> p> 0, define t]i = n~^(logn)^/^. For ij > rji, rp^q{r]) is bounded by 
a multiple of n^i~PyPr]'' and so (24), in fact, holds for all rj <i]q, but only 
gives a stronger result than (23) if rj <r]i. This is not in contradiction with 
the result (17) of Donoho and Johnstone (1994) because the condition (16) 
can be rewritten precisely as 

rp,g(7?)=o(n(''-^')V), 

or equivalently, Ty/r/i — > oo. 

The three bounds in Theorem 2 may be considered as corresponding to 
three different zones of estimation. If 77 > ryo, then the signal is insufficiently 
sparse for any order of magnitude advantage to be gained by the use of 
our thresholding method. In the zone rji <rj <r]o, a suitable thresholding 
method allows for considerable improvement over the use of a "classical" 
estimator. Finally, in the extremely sparse zone rj < r]i, the r/-dependent 
part of the error achieved by our estimator compares to that given by the 
estimator that simply returns the value zero. 

Note finally that for the standard estimator all the ry-dependent risks in 
the zone rj'^ < log^ n (for p > 0) are dominated by the term n~^(logn)^^'''^~'P^/^ 
in the error bound. Since log^n > rj^, the zone where n~^(logn)'^^('?~P)/'^ 
dominates includes the very sparse zone, and so there is no point in pursuing 
the kind of adaptivity within the very sparse zone achieved asymptotically 
by the modified estimator. 

4.2. Practicalities. Both theory and intuition suggest that the modi- 
fied estimator may only be advantageous for very large values of n, where 
5 log log n is small relative to 2 log n. Otherwise, data that ought to be thresh- 
olded with moderate thresholds will essentially be zeroed instead. For ex- 
ample, for n = 10^, we have 5 log log n = 13.13 and 21ogn = 27.63. Hence, if 
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the squared estimated threshold in the standard estimator is any more than 
about half the universal threshold, the modification will use a much larger 
threshold, thereby causing problems for signals that are nowhere near the 
very sparse zone. 

A version of the modified estimator was investigated by simulation on the 
same models as considered in Table 1. The Laplace prior with both param- 
eters estimated by marginal maximum likelihood was used. If the estimated 
threshold was less than 95% of the universal threshold, the posterior me- 
dian estimate was used. Otherwise, we used hard thresholding with threshold 
2^1og 1000, corresponding to A = l. 

The only models for which the estimates were affected were those with 
only 5 nonzero entries. In each case the average squared error was increased 
by the use of the modified estimator, respectively to 41, 40, 26 and 13, as 
compared to 36, 32, 17 and 8, for the cases where the nonzero parameter 
value was 3, 4, 5 and 7. Reducing the number of nonzero parameters to 
1 did not change the relative performance of the unmodified and modified 
methods, unless the nonzero parameter value was also increased. The only 
case tested where the modified method improved the performance was where 
there was a single nonzero parameter value with value 10. In this case the 
unmodified estimator has an average squared error (over 100 simulations) 
of about 2.4, while the modified estimator has a mean squared error of 
just over 1. As might be expected, the modified estimator is only clearly 
advantageous in very sparse cases where nonzero values of the parameters 
are well above the universal threshold — and in these cases the error of the 
unmodified method is already very small, so any improvement may be large 
in relative terms but small in absolute terms. 

5. Proofs of results: some detailed preliminaries. The remainder of the 
paper is devoted to the proofs of the theorems stated above. We begin in 
this section with a detailed discussion of a number of topics that will be 
useful later in the proof. In some cases these also cast a broader light on the 
empirical Bayes thresholding procedure. Our proofs cover the cases of nearly 
black and strong ip constraints on the underlying parameter vector fi. We 
conjecture that similar arguments can be used for weak ip constraints too, 
but the full details are left for future investigation. 

5.1. Properties and definitions for the mixture model. The arguments of 
this section and the next do not strongly depend on the precise assumption 
of Gaussian errors in model (1). Indeed, relaxing this assumption sheds some 
light on the robustness of our results to model formulation; see Remark 1. 

For the moment then, we assume that in model (1) the noise coefficients 
€i are i.i.d. from a symmetric Polya frequency PF-^ density ip. Polya fre- 
quency functions are discussed in detail by Karlin (1968), and from a statis- 
tical perspective by Brown, Johnstone and MacGibbon (1981). The defining 
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property of a PF3 function tp is that for yi <y2 < Us and zi < Z2 < Z3, 

(25) detMy^-Zj)]>0. 

l<t,j<3 

Examples of such densities include the Gaussian density <j) (observe the 
distinction of notation), as well as the somewhat heavier tailed Laplace 
density ^e"'^' and logistic density e~^/(l + e~^)^. The PF3 assumption 
implies that (p is log-concave, and hence there exists p > such that 

(26) ip{y)e^^ is decreasing for sufficiently large y. 

Thus, the tails of <p cannot be heavier than exponential. 

For this section only, we also modify assumption (7) on the prior to require 
only that 7(tt) > for all u and the existence of positive A and M such that 



(27) sup 

u>M 



<A<p. 



[In the Gaussian error case, ip = (j), this places no essential constraint on A, 
because we can choose p to be arbitrarily large.] Assumptions (26) and (27) 
taken together imply that the tails of the prior 7 are heavier than those of 
the noise density. 

The first part of the following lemma shows that the convolution 7 * 99 
inherits properties assumed of 7. 

Lemma 1. Assume (26) and (27), and let g = -y-kip. Then 

(28) g{x)^jix), 

(29) (1 + v?)g{u) is bounded for all u, 

POO 

(30) g{y)-' g{u)du^y--\ 

(31) limsup|(logg)'(u)| < A 

and g/ip is strictly increasing from ((7/99) (0) < 1 to +00 as x — > 00. 

Proof. It follows from (27) that e^^7(y) is an increasing function of y 
for y > M, and since 7 is unimodal, that for all x and y in [0,M], 

e^^7(x) < Ce^y-fiy) 

for some C > 1. Combining these two observations implies that, given any 
X > and u > 0, 

(32) 7(x + m) > C7-ie-%(x) and 7(2; - u) < Ce%(x). 
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g{x) > / (f{u)j{x + u)du>C ^7(x) 



-Au 



ip{u) du 



ip{u){-f{x + u)+-f{x-u)}du<C-i{x) I {1 + e^'^)ip{u)du, 



and 

9{x) 



and the right-hand integrals are finite from (26) because A < p. Properties 
(29) and (30) follow immediately from (28) and the assumptions on 7. 

For (31), setting Aqo = sup|(log7)'|, we have |7'(u)| < A'y{u) for u> M 
and |7'(ti)| < Aoo7('u) for u < M. Therefore, 



|(log5)'(x)| = \g'{x)\/g{x) 



ip{x — u)j'{u) du 



< A 



M 



ip{x — u)j{u) du/g{x) + Ac 



M 



(p{x — u)j{u) du/g{x) 



< A + Aooyo(x), 
where from (32) 



p{x) 



Ix-M 

< C[^{x)/g{x)] 



7(3; — v)ip{v) dv/g{x) 

e^V(w) dv^O 



as X — > 00. 



x-M 



To demonstrate that g{x)/ip{x) is increasing on [0, 00), let ru{x) = [ip{x + 
u) + ip{x — u)]/ip(x). Using the symmetry of 7, we have the representation 



(33) 



(fix) 



ruix)j{u)du, 



and so it will suffice to show that, for each n > 0, ru{x) is an increasing 
function of x on [0,oo). Suppose that X2 > xi > and consider the defining 
inequality (25), with {yi} = {—xi,xi,X2} and {zi} = {—u,0,u}. Subtracting 
the second row in the determinant from the first and exploiting symmetry 
of (f) gives 



< [v{-xi +u) - (p{xi + u)] 



1 -1 

(p{xi+u) f{xi) (p{xi-u) 

(p{x2+u) f{x2) (p{x2-u) 

ip{xi)ip{x2)[ip{-xi + u) - (p{xi + u)][ruix2) -r„(xi)]. 



Since 93 > and ip{xi + u) < f{—xi + u), this implies that r„(x2) > ru{xi), 
as required. 
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Finally, to show that g{x)/(t){x) — > oo as x — > oo, we have, for any x > 1, 
using the result (8), 



g{x)> j{x - v)(j){v) dv > j{x) (l){v)dv>Ce-^\''l 
Jo Jo 



□ 



Posterior odds. Write Odds(^|S) for P{A\B)/[l - P{A\B)]. Given w, 
define w{x,w) to be the posterior weight P(/i ^ 0\X = x), so that the pos- 
terior odds Odds(// / 0|x) are given by 

il{x) = il{x,w) = — = [x). 

l — w[x,w) l — wip 

Define wq = {ip/g){0)/[l + {ip/g){0)], so that ^^(0,-^0) = 1. For fixed w, 
Lemma 1 shows that ^}{x) increases from 1 to 00, so that ii w < wq, then 
there exists t{w) > for which il.{T{w),w) = 1. If we define t^w) = for 
w > wq, it follows that w t{w) is a continuous decreasing function of 
w G [0,1]. We will repeatedly use the function r in our subsequent argu- 
ment. 

A simple consequence of these definitions is that for wi Kwq, 

(34) n(T(wi),w) = -^^—^>l, ifu;>u;i. 

1 — W Wl 

Finally, for x > r, we clearly have 

(35) n{x) = 0(t) exp (log g)' - (log ^Y}. 
5.2. Properties of the posterior median. 

Lemma 2. Assume (26) and (27). The posterior median fi{x;w) is 

(i) monotone in x: if xi < X2, then fi{xi) < jl{x2), 

(ii) antisymmetric: /u(— x) = —jl{x), 

(iii) a shrinkage rule: < /i(x) < x for x > 0, 

(iv) a threshold rule: there exists t[w) > such that jji{x) =0 if and only 
if \x\ < t{w), 

(v) bounded shrinkage: there exists a constant b such that for all w,x, 

|/i(x; w) — x\ < t{w) + b. 

Remark 1. The lemma demonstrates that the posterior median has all 
the properties needed for the estimation error bounds that will be derived for 
Gaussian errors in the subsequent sections. The bounded shrinkage property 
essentially means that rare large observations are more or less reliably as- 
signed to a sparse signal rather than noise in our Bayesian model; conditions 
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(26) and (27) indicate that a sufficient condition for this is that the tails of 
the prior be heavier than the tails of the noise distribution. At least in this 
situation, one may expect the qualitative features of our theory to remain 
true; it is left to future work to investigate whether there are differences in 
quantitative thresholds and, perhaps, in rates of convergence. 

Proof of Lemma 2. Suppose that fj, has general prior density /, with 
respect to a suitable dominating measure. Then the posterior density 

f{fi\x) = C{x)ip{x - ij)f{ij), 

so that, for any u<v and X2 > xi, 

f{v\x2)f{u\xi) ^ ip{x2 - v) (p{xi - u) ^ ^ 

f{u\x2)f{v\xi) ip{x2 - U) ip{xi - V) ~ 

so that 

f{v\xi)f{u\x2) < f{u\xi)f{v\x2). 

Now, for any m, integrate with respect to the dominating measure over 
— oo <u<m and m < u < cxo to obtain 

P{fi > m|xi)P(^ < m\x2) < P{fi > m\x2)P{fJ' < m\xi), 

so that the odds that fi<m are greater for X = xi than for X = X2- Letting 
jl{x) be the posterior median of fj,, given X = x,it follows that /i(x2) > A(^i)) 
so the posterior median is a monotonic function of x. 

Return now to the mixture prior (2). The antisymmetry of the posterior 
median is immediate from the symmetry of the prior and the error distribu- 
tion. If w >0, the probabilities P{fi < 0\X = x) and P{n = 0\X = x) will be 
nonzero for all x and each will vary continuously as a function of x. By sym- 
metry, P{fi < 0\X = 0) = P{fi > 0\X = 0) < ^ and so there will be a range of 
values of x containing for which the posterior median is 0. By symmetry 
and the monotonicity of ft there will be some threshold t{w) such that the 
posterior median is zero if and only if — t < x < t. The posterior median of /i, 
given X = T, is necessarily zero, so t{w) < t{w). 

Suppose X > 0. By the assumption that 7 is symmetric and unimodal, 
7(x — v) > 7(x -|- v) for all v >0. Hence, multiplying by (p{v)/g{x), if x > 0, 

(36) f{x-v\X = x,n^O)>f{x + v\X = x,fi^O) for all t; > 0. 

Integrating over < i; < 00, 

p{^i<x\x = x,^lJ^o)>p{^l>x\x = x,^lJ^o). 

Therefore, 

P{H > x\X = x) < P{i^i > x\X = x,fij^O) 
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and so the posterior median satisfies fi{x) < x for all x > 0. By the mono- 
tonicity of fi we have the shrinkage property < fi{x) < x for all x > 0; by 
symmetry it is also the case that > fi{x) > x for x <0. 

Finally we show that the maximum amount of shrinkage is appropriately 
bounded: the approach is to find a constant a such that for all sufficiently 
large x, 

Piii > X — a\X = x) 

(37) 

= P{fi>x-a\X = x,fiy^ 0)P{i^i / 0|X = x) > i. 

The term P{fi > x — a\X = x,fi^O) does not depend on w, and we con- 
sider it first. Set B = sup|„|<jv,/ 7(n)e^"/7(M)e^*^. For n < and for u>M, 
u — > 7(^)6"^" is increasing and so for any c > M we have 

Odds(// > c|X = X, /i / 0) 
(38) 'y{u)ip{x — u) du ^ e~^'^(p{u — x) du 



/f 00 7('ii)v5(x — u) B J^^e ^^ip{u — x)du 

Since A < we have e~^'"ip{v) dv < 00, and so there is a value a such 
that 

/oo /■ — a 

e-^"ip{v) dv>3B e-^"ip{v) dv. 
-a J —00 

As long as X > a + M, from (38) and (39) we will then have 

r~ e-^Xr;) dv 

(40) Odds(/i > X - a|X = X, / 0) > 't!, / w > 2' 

Bj_^e-^^^p{v)dv 

so that 

(41) P{fi>x-a\X = x,n^O)>l. 

Now set e = {p — A)/2. Taking into account (26), (27) and (31), choose 
Ti > M large enough so that for \u\ > ti we have 

(42) (logff)'(n) > -A - e, (log(^)'(u) < -p. 

Choose uji so that r(wi) = ri, and define ci = 2{p — A)~^log2. Suppose 
w < wi, so that t{w) > Ti. It follows from (35) and (42) that, if x > t{w) + ci, 
then 

(43) Odds(^ / 0\X = x) = n{x, w) > n{T{w),w)e^P-^^^''-^^^^ > 2. 

On the other hand, w > wi we will have 0,{x, w) > ^}{x, wi) >2 as long as 
X > Ti + ci. In either case, it follows that P{p 0\X = x) > |. 
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Combining this bound with (41), it fohows that (37) is guaranteed when- 
ever X > max{a + M,t{w) + ci,ri + ci}; otherwise, ah we can say is that 
X — ji{x) < X. Hence, for ah x > and w G [0, 1], 

X — jl{x) < max{o, a + M, t{w) + ci, ri + ci} < t{w) + b 

with 6 = Ti + a V ci , which yields the required shrinkage bound since t{w) < 
t{w). □ 

5.3. Properties of posterior median for Gaussian errors. For the remain- 
der of the paper we speciahze to the Gaussian error density (j) in model (1) 
and to the global boundedness assumption (7) on the logarithmic derivative 
of the prior 7. Property (31) is then strengthened to 



(44) sup 



< A < 00. 



When f = (j), the representation (33) yields 
(45) 1 + p{y) = {g/(p){y) = 2 / cosh(yi)e-* 'h{t) dt. 



Since coshyt is an even convex positive function for each i, it follows that 
l + f3{y) is also. Also, from (45) < l + /3(0) < 1, so that -1 < /3(0) < 0. We 
denote by 13""^ the positive inverse of /3, defined on the interval [/3(0),oo). 
We also have the following simple bounds: 

(46) \l3{y)\<C{g/4>){y), for all y, 

(47) \{g/<i^){y) < m < {gl<t^){y). if y > 

For a lower bound on the second derivative of /?, from (45) we have, for 

(48) (3"{y)= t^cosh{yt)e-' /^^{t)dt> t'^e-' /S{t)dt = (3"{0)>0. 



We now develop an explicit form for the equation defining the threshold 
t{w) . Define g+ (x) = 4>{x - i^)'y{n) dfi and c/_ (x) = (l){x - fi)-f{fi) dfi. 
Then 

p(M>o|x = rifi ,r 

[1 — w)(p[x) + wg[x) 

Therefore, the threshold t satisfies 

(49) 2wg+{t) = {l-w)(^it) + wgit)- 
Dividing by w(j){t) and rearranging yields 

(50) 1 = 1 + g+(t)-g-(0 = 1 + 2 sinh(t^)e-^VS(M) d/i. 
w (pit) Jo 

This equation shows that the posterior median threshold t{w) is continuous 
and strictly decreasing from cxo at = to zero at w = 1. 
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5.4. The link between threshold and pseudothreshold. It will be useful to 
find bounds on the threshold of the posterior median function in terms of 
the weight w. It will be convenient to define the pseudothreshold C('"^) by 

The following result sets out relations between the pseudothreshold C(''^) 
and the true threshold t{w) of the posterior median function. In most of our 
discussion the dependence of t and C on it; is not expressed explicitly, and, 
indeed, any two of t, ^ and w can be regarded as functions of the third. 

Lemmas. For all w £ {0,1], 

(51) 1 + P{t{w)} < f^iCiw)} < 2 + mw)}. 

Proof. The bounds are a straightforward consequence of (50) defin- 
ing t{w), which may be rewritten in the form 

/3(C)=/?(t) + 2-2<7_(i)Mi)- 

Clearly, 

0<g^{t)= r (^it-fi)j{fi)dfi<<P{t) r 7(/i)(i^ = i</>(t). 

J—oo J—oo 

Thus, < 2g^{t) / (t){t) < 1, which establishes (51). □ 

From the properties of (3 we can derive two important corollaries. First, 
we have < t < ^ for all finite t and so that 

(52) t"^ < C^. 

Second, from the property (48) that P"{y) > C for all y, it follows that, for 
y > 0, 13' {y) > Cy. Therefore, 

A^(v;)=V"V(v/5)>ic, 

so that 

/3(c) - Pit) = f3iVe) - f3{V^) > ^cie - 1"). 

Therefore, 

(53) - < 2C-H/3(C) - /5(t)} < 4C-\ 

so (for a different value of C) —t^ < — + C and so finally, for some constant 
C>0, 

(54) cP{t)<Cm- 
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5.5. Properties of the posterior mean. In this section we consider the 
effects of using the posterior mean as an estimate instead of the posterior 
median. We begin by considering the behavior of the posterior distribution 
conditional on n^O, which is also the unconditional case w = 1. Given any 
X, define 



u(j){x — u)j{u) du 

jli{x) =E^ost{^^\X = x,fii^{)) 



(■OO 

-oo 
roo 



/-^ '?^(^ - u)-i{u) du ■ 

A simple argument using the property i^'(i) = —t(f){t) shows that lii{x) = 
x + (\oggy{x), and, hence, using the bound (44), 

(55) \fli{x) - x\ < A. 

Defining fi{x,w) to be the posterior mean E(fi\X = x), we then have 

(56) fl{x,w)=P{fi^O\X = x)E{n\X = x,fi^O)=w{x,w)fli{x). 

From (36), if a; > 0, the posterior mean fli{x) < x; by a similar argument, 
for V > 0, 

f^{v\X = / 0) > Ui-v\X = ^ 0) 

and so /ii(x) > 0. Also, by a simple extension of the corresponding argument 
at the beginning of Section 5.2, /ii is an increasing function of x. Hence, jli 
is a shrinkage rule, and from (56), so is jl{-,w). 

For each x the the posterior weight w{x,w) is monotone increasing in w; 
for X > it follows from (56) that so also is the posterior mean fl{x,w). 

Bounded shrinkage properties of the posterior mean. From (55) and 
(56) we have 

X — jl{x, w) = {1 — w)x — wilogg)' {x). 

Choose wi so that t{wi) = A, and let T2 = t{w /\ wi) > h. Using (35) and 
(44), we have 

i}{x)>expl^J (u — A) (iu| > exp I y (u — T2) (in| = exp {^(m — r2)^}. 
From (34), Q{t2) = Q{t{w A wi),w) > 1 and so for x > T2, 

1-W< l/n{x) < exp{-i(x - T2f}. 

Combining this with bound (44), we obtain for x > T2, 

x - fl{x,w) < (x -r2)(l -w)+T2 + I (log 5)' (x) I 

<{x- T2) exp{-i(2: - T2f} +T2+A< + 2A + t{w). 

If < X < r2, then trivially x — fi<x<T2<A + t{w), so that we have shown 
that the posterior mean is a bounded shrinkage rule relative to the threshold 
t{w). 
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5.6. Bounds for integrals of exponential growth. 
Lemma 4. // (log/i)'(z) > (logA:)'(2) for ze[Ci,C], then 
MOV^ 1^ h{z)dz<{k{Q)Y^ ('^k{z)dz 



< 



U-^l-e-^^^-'^i)], ifk{z)=e^^, 

where in the second case we require also that Ci > and 7C > max(Q,0). 

Proof. The first inequality is seen easily by writing h{z)/h{() = exp{— /^^(log/i)'}, 
applying the assumed inequality and integrating. The second inquality for 
k{z) = e^^ is trivial. For k{z) = e^^ we first note that change of scale 
shows that it suffices to prove the bound for 7 = 1/2. Replacing C,i by and 
completing the square, we then find that the desired bound is implied by 

(57) /^"%^V2rf^<4 (c-„)V2, 



C 

If C > 2max(a,0), then a <(, — a and the integral on the left is bounded by 
2/o^-"e^''/2(i^;. Equation (57) now follows from the inequalities 



e 



-wV2 / g.V2 = / ~ ^-{u,-v){n,+v)/2 



„ roo 

< / e-^'"-''^'"/^ dv < / e-'''"/'^dx = 2/w. 



□ 



Corollary 1. Ifg = j-k(p and j satisfies (7), then 

(58) [\g/(t^y{x)4^{x -^)dx< HgiC; A, /i)(<7/<^)''(C)</'(C " /^), 
Jo 

where 

r8/[(g-l)C], ifq>l,C>2qA/iq-l), 

(59) F,(C;A,^) = U, ifq = l,fi>A, 

[{e^'^-l)/A, ifq = l,0<fi<A. 

Proof. Let h{x) = {g/(l)y{x)((>{x - fi). Then 

f-i-A, q = l, 



(log h) (x) = (?(log g) {x) + qx — {x — fj,) > 



{q - l)x + fi - qA, q>l. 



If g = 1, we apply the preceding lemma with logA;(^;) = (11 — A)z and Ci = 
and obtain factor Hi{(;A,fi) according as > A or not. For g > 1, we use 
the version with log A; quadratic, 7 = (g — l)/2 and a = qA — fj,, so that 
^( > max(a, 0) becomes C> {2/{q — I) max(gA — /i, 0). □ 
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6. Risk properties of thresholding procedures. In this section we study 
the risk behavior of thresholding procedures. Because the thresholds ob- 
tained by the empirical Bayes procedure are data-dependent, some care is 
appropriate in deriving the risk. We begin with risk bounds for hard thresh- 
olding using fixed, nonrandom thresholds. These lead to comparison inequal- 
ities and so to bounds for the risk for general random thresholds. The latter 
continue to hold if the threshold is replaced by a pseudothreshold that is 
easier to find for the mixture prior model. Analogs for the posterior mean 
are studied as well. 

6.1. Risk bounds for fixed thresholds. As a tool for later work, we develop 
risk bounds for hard thresholding. 



in Lq error for <q<2. For the posterior mean estimator fi^Xjiv) of (56), 
a bound of similar structure holds for q> 1, based on the pseudothreshold 
C = (3"^{'w~^) in place of t. 

Proposition 1. (a) Fix q e (0,2]. There exists a constant < 4 such 



The main use of these bounds is to control risks when is not too large, 
and especially when — > 0. The second term in each bound is, up to con- 
stants, a sharp representation of the risk at ^Li = as a function of t or ^. 

Remark 2 . If g = 1 , it can be shown that the risk for the posterior mean 
at zero, 



is already of larger order than in (61), and so our methods for the analysis of 
the behavior of the posterior mean cannot immediately be extended beyond 
the range 1 < q <2. More remarks will be made in Section 10. 

Proof of Proposition 1. We begin with a simple bound vahd for 
any shrinkage rule fi{x). Indeed, for any fj, and x, 



fiHT{x,t) =xl{\x\ >t} 




E\fi{Z,w)\>cw>cm/9iO 



(62) 



\fl{x) - fil" < max{|/i|'', \x - fil"} < l/xl" + \x- 
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Hence, if X ~ N{fi, 1) and aq = E\Z\i, 

(63) E\fi{X)-fi\'i<\fi\'i + ag. 
From this we immediately have, when > 1, 

(64) E\fi{X) - fil" < (1 + a,)|/ir < 

and so, for the rest of the proof, we confine attention to < 1, and, indeed, 
in view of symmetry of the risk functions, to < ^ < 1. 

(a) Write rq{fi,t) for the risk £'|X/{|X| > t} — fi\'^ of hard thresholding. 
We have 

rg{fi,t)=fi'^Mt-fi)-^-t-^i)] + ( +/ )\z\mz)dz. 
By partial integration we obtain the upper bound 

("OO 

(65) rg{0,t)=2 zmz)dz<bgt'i-'^(p{t), 



where bg may be taken as 2 for q <1 and as 4 when 1 < g < 2 and t > \/2. 
By subtraction, 

(66) rgifi, t) - rg{0, t) = ^l'i[^{t - ^i) - <^{-t - fi)] + A(^, t), 

where 



A(/x)=A(/.,t)=( r -/* 

\Jt-n Jt 



zU(z)dz. 



The function (j)q{t) = t'^(j){t) is positive on (0,oo) for all q, and for q>0 
attains its maximum value (p* = <f>{0){q/e)''/'^ at t = ^/q. We remark that 
(pl ^ 1/2 when < g < 3. Set (pg(t,fi) = (j)g{t — fj,) + (pg{t + fj.): some calculation 
then shows that when 0< fi<t, 

Since A(0) = A'(0) = 0, we therefore have A(//) < (pg^ifx^, at least for < ^ < t. 
Combining this with (66), we have 

rg{fi,t) <rg{0,t) + fj,'' + (P*g^^fi^ . 

If ^ < 1, then /x^ < fi'^, and bringing in (65), we obtain (60) with Cg < 4. 
(b) In view of (62) and (65), we have, for < /i < ^ and C ^ 2, 

[fi" + \x - n\'>]c^{x - n) dx 

(67) < 2[/.'^(c - fi)-' + 2(c - f^r~'m - /i) 

<5c«-V(c-/")- 
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On the interval < x < C we have here I/cq = g{0)/(j){0) < 1 + wj3{x) < 2, 
and so 

_ wg{x)/(l){x) / N/i,/ N 

Together with (55) this shows that fl{x,'w) < cq'w{C + A){g/(p){x), and so 
E[\fi-fi\^,\X\<C] 



(68) 



< 2'?-^*'^ + [2cou;(C + A)]"^ / {g/cpyix^ix - n) dx 

Jo 

(69) < 2^5 + [8/{q - l)C][3cowC{g/4>mr4>iC " /i), 

using (58) and (59), vahd for ( > 2qA/{q — 1), and noting that for such C, 
we have also C + A < 3(^/2. Since (3{C) = > 1, we always have {g/(p){C) = 
/3(C) + 1 < 2/3(0 = 2w~^. Inserting these remarks into (69) and combining 
with (67) yields, for G [0, i] and C > Co = max{2, /3~i(l), 2gA/(g - 1)}, that 

E\~^-fi\'^<2fi'^ + CgC'~^ct>{C-fi). 

For < /.i < 1/C, one has ((){(,— fi) < e4>{(), while for 1/C < /i < ^, some cal- 
culus shows that C'^"^'/'(C~/^) < /^^ < A*'- This completes the proof of (61) for 
< /i < ^, while for /i > ^ the bound follows by a simple modification of (64). 

□ 



6.2. Risk bounds for general random thresholds. We begin with a sim- 
ple bound. Suppose that 5 is a shrinkage rule with the bounded shrinkage 
property, and that t is a random threshold with t <t with probability one 
on the event A. Then 

E\6{X,i) - fil'^lA < 2E[\5{X,i) - Xl" + \X- /il"]/^ 

(70) <2{\t + blTiA) + [E\X - /il^-?] V2p(^)i/2} 

<4{t« + 69 + l}P(yl)i/2. 

[We have used ^jZp? < {EZ^)^i/^ < 3 for g < 2.] 

We now consider more specific risk bounds for random thresholds. The 
first will be particularly useful for small values of the true mean fj,, in con- 
junction with a constant t which is with high probability a lower bound for 
the threshold t. 

Lemma 5. (a) Suppose that < q <2, that X N{fi, 1) and that t is a 
random threshold that may depend both on X and on other data. Suppose 
that 5 is a thresholding rule with the bounded shrinkage property, and let 



fi = 5{X,i). 
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Suppose that t > \/2- Then for all fi, 

(71) E\il - < c,[|/ir + t'^Uit) + (f^ + + l){Pit < t)}'/']. 

(b) If 1 < q < 2, a similar result holds for the posterior mean with esti- 
mated pseudothreshold C- For C > CoiQil) CLiT-d for all fi, 

(72) E\f,ix,w) - < c'gM'^ + C-'m + + b'^ + l){PiC < C)}'/']. 

Proof. The method for both parts is essentially identical, so we con- 
centrate on the thresholding case (a). Denote by n*{X) the effect of applying 
to X the hard thresholding rule with threshold t. If t is a data dependent 
threshold with t >t, then it follows from the shrinkage and thresholding 
properties of fi and fj.* that both 

(73) sign(/i) =sign(^*) and < < |/U*|. 
Hence, 

\fi - nl" < maxll^l', \fi* - fi\i} < + l/i* - /il?. 
If we remove the overall constraint that t>t, it remains the case that 

|/i-/i|«/[t>t]<|/i|« + |/i*-/x|«. 

The inequality (60) for the risk of the hard thresholding rule /i* with fixed 
threshold t shows that for t > \/2, 

(74) E\fi - fi\H[i >t]< cjl/i^ + f^-'m]- 

Now consider the case t <t. By the bounded shrinkage property and (70) it 
follows that 

(75) E\fi - <t]< Ait" + b'' + l){P{i < t)}^/\ 

Putting together the two bounds (74) and (75) completes the proof of (71). 

For the posterior mean, we use the pseudothreshold ^ a-nd set fi{x) = 
fx{x,w{C)) and /i*(x) = il{x,w{C)) in the above argument. The key mono- 
tonicity property (73) follows from that of w ^ fl{x,w), and the analog of 
(74) uses (61). Finally, we use the bounded shrinkage property of the pos- 
terior mean. □ 

The second lemma will be used in practice for larger values of /x. 

Lemma 6. Make the same assumptions as Lemma 5 but relax the con- 
dition that 6 is necessarily a strict thresholding rule; it is still required that 
6 has the bounded shrinkage property. Suppose that t satisfies the inequality 

(76) i < V d log n with probability 1. 

Let t be a nonrandom threshold, possibly depending on n. Then 

(77) E\fi - < 8[t« + ¥ + ! + {d\ognfl'^{P{i > t)}^/'^]. 



EMPIRICAL BAYES FOR SPARSE SEQUENCES 37 

Proof. To prove Lemma 6, suppose first that t < t. From (70) with 
A = {t <t}, we have 

(78) E\fi- fi\H[i<t]<4{t^ + b'' + 1). 

Now use (70) again, now with A = {t > t} and note that t < \J d log n w.p.l 
on ^, so that 

(79) E\fi - >t]< 4((dlogn)''/2 + fo? + i){p(£ > t)}^/\ 

Combining the two results (78) and (79) completes the proof of Lemma 6. 
□ 

Remark 3. Lemma 6 applies in particular to the posterior mean rule 
fi = fi{x,'w{C)) with estimated pseudothreshold (. 

It also follows from (52) and (54) that the bounds in Lemmas 5 and 6 
remain valid if thresholds t are replaced by pseudothresholds C throughout. 

7. Moments of the score function. In this section we derive properties 
of the score function S{w) that will facilitate our detailed consideration of 
the behavior of w. Suppose that Z ~ A^(0, 1) and define mi (//, w) = E (3{Z + 
fj,,w) and m2{iJ,,w) = E (3{Z + fi,w)'^ . We first note that 

of^ Jo 

For k = l, this shows that /i — > mi{fi,w) is increasing for > 0. 

7.1. The moments m{w) and mk{fi,w) as functions of w. We give a 
special name to the mean zero case and study it first: 

POO 

(80) m{w) :=-mi{0,w) = -2 (5{z,w)<i){z) dz. 

Jo 

Lemma 7. The function w — > m{w) is nonnegative and increasing in 
w S [0,1] and satisfies m(0) =0. If C = I3~^{w~^) is the pseudothreshold 
discussed in Section 5.4, then 

(81) rh{w)^C~^g{C) asw^O. 

Lemma 8. Fix fi> 0. The function w — > mi{fi,w) is decreasing in w £ 
[0,1] and satisfies mi{fi,0) > 0. In terms of ( = l3^^(w^^), for sufficiently 
small w < Wo{'y) (not depending on n) we have 

(82) mi{fi,w)>^mHC-l^), 

(83) m2{fi,w) < Cw~^mi{fi,w), Ai>l, 
while 

(84) mi{C,w) ^w~^ asw^O. 
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Proof of Lemma 7. For each z / /3~^(0), fiiz^w) is a decreasing func- 
tion of w and so m{w) is increasing. It follows that, as w\0, 

m{w) \ m(0) = - / I3{z)(l){z) dz = {^(z) - g{z)} dz = 0. 



To study the asymptotic behavior of m{w) as — > 0, use the property 
J-oo P(.y)4>{y) dy = Oto obtain 

m{w) = 2 ^ ^^J. . (P{z)dz. 

Define C, = (3~^{w~^). On the range z < we have w(3{z) < 1, so that 1 + 
P{0) < 1 + w(3{z) < 2. On the other hand, for z > C we have wP{z) < 1 + 
wP{z) < 2wP{z). It follows that 

(85) m{w)>i wp{zf(f>{z)dz+ P{z)(f>{z) dz. 



10 Jc 
Appealing to (46), (58) and (59), we then have for ( > 4A, 

/3(tx)20(n) du<C g{uf/<t^{u) du < 8CC'g{Cf/HC) 



(86) 1 1 1 

using (47) and assuming also that C ^ Co- Hence, the first integral in (85) is 
bounded by a term of order C^^g{C)- 

Because (3{u)(j){u) ~ g{u) as u — > oo, the second integral in (85) is asymp- 
totic, via (30) to 

/■oo 

K-1, 



giu)du^C-'giC). 

This term strictly dominates the bound C~^5'(C) and, therefore, we can con- 
clude that m{w) is bounded above and below by multiples of C'^~^g{C)- D 

Proof of Lemma 8. Note first that the expression 

/oo 
P{t, w)(j){t — fi) dt 
-oo 

shows that mi{fi,w) increases monotonically as w — > 0. The limiting value 
is 

mi(M,0)= J^^p{t)(l^{t-fi)dt = J"^(^^-iy{t-fi)dt 
= exp (^fit-^fx^^g{t)dt-l = e->''/^Mg{fi)-l, 
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where Mg denotes the moment generating function of g. Since g is the con- 
volution of 7 and (/>, and 7 is symmetric, 

POO 

= 2 1 cosh{fit)j{t) dt > 1, 
Jo 

so that mi(/i,0) > 0. [If g has sufficiently heavy tails, then mi(/i,0) may be 
infinite.] 

For sufficiently small w, we have 
(87) f3{t,w)(l){t- n)dt>0, 



since the limiting value of this expression is mi(/i,0) > 0. It follows that 

roo Bit) 1 
mi{fi,w)>J^ ^^^^^^^^ (A(t-/i)cit>-u;"^y^ <Pit-f,)dt. 

We turn to the bound on m2{fi,w). Notice first that 

(88) |«.,«,)|<('^ = l«»'l/(' + «»»- '!«=^)<»- 

[w , if p{x) > 0. 

Hence, 

E\P{ii + Z,w)\ = mi{fi,w) + E{\(3{fi + Z,w)\ -p{ii + Z,w)} 

< mi{fi, w) + 2C < Cmi{fi, w) 

for sufficiently small w and ^ > 1, since we then have mi{fi,w) > mi{l,w) > 
C. It also follows from (88) that, again for sufficiently small \l3{t,w)\ < 
for all t, and so 

m2{fi,w) <E{P{n + Z,wf} <w~^E\P{i^i + Z,w)\ < Cw'^mi{fi,w). 
Turning finally to the proof of (84), we have 

'mi{C,w)= [ 1^^'^^^] 4>{z)dz = w"^ [ r{C,z)(l){z)dz, 
J l + wp{z + C) J 

where 

as C — > 00, since letting Oi(Az) denote a quantity bounded in absolute value 
by \Az\, 

m 5(c) 4>{c+z) 



exp{Oi(Az) -Cz- z^/2] | 



0, z>0, 
00, z < 0. 
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The conclusion (84) follows from the dominated convergence theorem, since 
\r{C,z)\ < 1 [at least for C large enough that /3(C) > 2|/3(0)|]. □ 

7.2. The moments mi;{fi,'w) as functions of fi. We shall need a series of 
bounds for mk{fi,w), each successively more refined as fi is constrained to 
be closer to zero. 



Lemma 9. There are constants Ci such that for all w, defining c as 
in (88), 

r -m{w) + ClC(^y)/i^ for |/i| < 1/C{w), 

(89) mi(M,u;) < <^ C2HC/2)w-\ for < C{w)/2, 

{ {w A c)~^, for all fj, 

and 

{CsC{w)~'^'w~^Th{w), for \fi\ < 1/C{w), 

C4C-V_(C/2)w;-^ for 1^1 < Ciw)/2, 

{w A c)~2, for all fi. 

Proof. We first remark that the global bounds mi^{^,w) < {w f\ c)"'^ 
follow trivially from (88). We derive a bound on the behavior of mi{fj,, w) — 
mi{Q,w) for small /i 7^ 0. Assume that < and that ( > 2. Then for 
allyGhCC], 

(91) (/>(y - fi) = (t){y) exp {fiy - ^fi^) < e(l){y) 
and 

(92) \(p"{y -^)\ = \[y-^f- l\^{y -f,)< c(l + /)0(y), 

where the absolute constant c' < 1.25e. Using the property that |/3(z,i(;)| is 
bounded above by min{u;-i, /3(z)} if P{z) > and by {1 + /3(0)}-i|/3(z)| if 
/3{z) <0, it follows that 

(93) ^W^'^) < J^J(3iz,w)c|>"iz-^,)\dz 

(94) <C \P{z)\{l + z^)(l){z) dz + 2w~^ I 4l'{z - /i) dz. 

J-C J\z\>( 

Since g{z) < C(l + z^)"^ for all z, it follows that 

\Pizmz)<\giz)-cPiz)\<C{l + z^r' 

and hence that 

(95) J^^\(3{z)\{l + z^)cp{z)dz<CC. 
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For < ("^ and (" > 2 we have 



/ (p"{z-fj.)dz 

J\z\>c 

<2w-^ / (p"{z-\fi\)dz 



(96) 



<ccmm<ccii+c^r'- 



Combining (95) and (96), recalling that C can be a different constant in 
different expressions, we can conclude that, for < and C > 2, 

(9^mi(^,w) 

Since by symmetry dmi{fj,,w)/dfi = when /i = 0, it follows that, for /i < 

mi{fi,w) — mi{0,w) < CC/i^, 



which completes the proof of 

Turn now to the second moment. Suppose throughout that < and 
C >2 and, without loss of generality, that /i > 0. By the bounds on |/3(z)/{l + 
w(3{z)}\ and on (j){z — /.i) as above, 

m2(/x, w)<C I3{zf(^{z -ii)dz+ [ P{Cf(p{z - //) dz 



(97) < C Pizf<P{z) dz + 2/3(C)'$(C - fi) 



(98) < C7t«-iC"'5(C) + C(3{Cf<PiC - m)/(C - /^) 

by using (86). To deal with the second term in (98), use the bounds on 
0(C ~ f^) and the property that ( — IJ,> C/2 to conclude that 

It follows that, for \fi\ < and C > 2, 

Now use the property (81) that g{C) < CC^~'^m{w) to complete the proof 
of (90). 

We now turn to the proof of the intermediate bounds. Note first that 



mk{fi,w) < 2 



poo 


\ m 1 


Jo 


l + wl3{x). 



k 



{x — /i) dx. 
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On [0, C] we have 1 + w(3{x) > 1 + (3{0) > 0, so that [1 + w(3]-'^ < C, while 
on [C,oo] clearly wf3/{l + w(3) < 1. Hence 

mk{n,w)<cj (3^ {x)(p{x - fi) dx + 2w~^ J 4>{x-n)dx 

= CIk,c + 2w-''l'k,c- 
Since /3(C) = , we have for < (1 — a)C-, 

ll^ = ^C-fi)<<P{aC)/aC. 

Turning now to Ik^c, ^ C io / 4')^ {x)4'{x — fJ.)dx, we apply (58) and (59): 
since {g/(j)){C) <'^w-^ and (/)(C - /i) < exp{-C(C - 2/i)/4}0(C/2) for < ^ < 
C/2, we have 

4,C < 2'=«;~V(C/2)//fc(C; A,/i)exp{-C(C - 2/x)/4}. 
The desired conclusion for k = 2 follows. For k = 1 one may check that 
sup /?i(C; A,^)exp{-C(C - 2/i)/4} < C(A), 

C>8A,0<At<C/2 

while a direct argument shows that for ( <8A, regardless of /i, /i ^ < Ctu"^ < 

8. The marginal maximum likelihood weight and its risk properties. The 

marginal maximum likelihood method yields a random weight w, dependent 
on all the data Xi, . . . ,Xn, and, hence, to a random threshold and pseu- 
dothreshold. In this section we study the properties of w in order to use 
the risk bounds of Section 6.2 to bound the risk for the whole procedure 
and, hence, complete the proof of Theorem 1. The structure of the proof is 
essentially the same for both nearly black and ip sparseness classes, and to 
avoid unnecessary repetition of arguments, it is helpful to define 

The bounds obtained in Theorems 1 and 2 do not change as p increases 
above 2. Furthermore, for p > 2 we have £p[i]] C and so demonstrating 
the bounds for p = 2 will imply that they hold for all larger p. Therefore, for 
the whole of the subsequent argument, we assume without loss of generality 
that p<2. 

The strategy is to consider separately the components of the risk for large 
and small /ij. We employ risk decompositions 

Rq{fi,fi) = n~^ E\jli - fiil'^ + n^'^ ^ E\fii - 

(100) 

= Rg{T)+Rg{T), 
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say. For "global" risk bounds, we take r = 1, while for risk bounds over ip[r]], 
we take r roughly of order (21ogr/~^)^/^. 

In each case properties of the estimated weight and corresponding pseu- 
dothreshold are derived; these are then substituted into the appropriate 
expression for the risk. We begin by the consideration of the threshold and 
risk for the components with small /.fj. 

8.1. Small signals: lower bounds for thresholds. Suppose that, for some 
p with <p<2 and for some 77 > 0, /x lies in an £p ball: 

or, if p = 0, that the proportion of fii that are nonzero is at most rj. Let Zi 
be independent iV(0, 1) random variables, and let w be the weight estimated 
from the data /Uj + Zi by the marginal maximum likelihood procedure. Define 
the pseudothreshold C = I3~^{'w~^). 

One cannot hope to adapt to signals that are too small relative to the 
sample size n; this corresponds to restricting t{'w) to the range [0,^/2]ogn]. 
Hence, we set 

fj^ = max {r7^,n~^ (log n)^} 

and, with the usual definition C = P~^{w~^), define the weight w = w{r],n) 
by 

(101) C^wrhiw) = ff. 

Writing the left-hand side as the product of C^~'^//5(C) and rfi{w), both of 
which are increasing in w (for w sufficiently small), shows that w is well 
defined and monotonically increasing in f/, at least for fj small. 

The intent of this definition is to choose a weight w = 'w{ri, n) and pseudo 
threshold C, = C,{ri.,n) which is both a lower bound to = C{w) for ^ S ^p[if\ 
with high probability (Lemma 10) and is of the right size to yield minimax 
risk bounds [see (103), (104) and Section 8.2]. 

Some properties oiw and C,- Using the definition of (3 and the property 
(81) that mlw)^C~^g{0, 

(102) v''^c-^wC"^9{o^c-\9{o/m)]^e~^m- 

We immediately obtain a bound. 



(103) 
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Taking logarithms, 

|iogr^-iC' + (;'-i)iogCI<c. 

Hence, as r/ — > 0, 

(104) C^~21og^-P. 
More explicitly, there exist constants c such that 

(105) ^2 > I 21og?r^ + (p- l)loglog7?"^' -c, if r/P > n^^log^n, 

[ 2 log n — {5 — p) log log n — c, if rj^ < n^^ log^ n. 

Approximation (104) shows that our pseudothreshold bound C,{rj,n) has 
the order of the minimax threshold for ^p[r/n], and the right-hand side 
of (103) is essentially the asymptotic expression for the normalized minimax 
risk. We now show that C{r],n) is typically a lower bound for the estimated 
pseudothreshold when the signal is small. 

Lemma 10. Let the pseudothreshold C = C(^)''^) corresponding to fj he 
defined by (101). There exist C = C["f) and rjQ = ?/o(7) such that if r] < 7]q 

and nj log^ n>r]Q^, then 

(106) sup P^(C<C)<exp{-C(logn)3/2}. 

It follows from this lemma that if is very sparse {rj^ < log^ n), then 
t and C are, in relative terms, close to \/2 logn. On the other hand, if is 
less sparse, then i and C. are at least about V 2log rj~^ . (Recall from (53) 
that the difference C — £ G [0, C/t] is small.) 

Proof of Lemma 10. This argument leading to (103) also shows that 

~ (^^ ~ 1) log C < log - 2 log log n + 0(1), 

and hence that t{w) < Ciw) < \/2 logn for n sufficiently large, so that w G 
[t(;n,l], the interval over which the likelihood i{w) is maximized. Conse- 
quently, {C < C} = > w} = {S{w) > 0}. The summands in S{w) = J27=i f^if^i + 
Zi,w) are independent, and in view of (88), bounded by cqw^^. 

We therefore recall Bernstein's inequality [e.g., Pollard (1984), page 193], 
which gives exponential bounds on the tail probabilities of the sum of uni- 
formly bounded independent random variables. 

Proposition 2. Suppose that Wi, W2, ■ ■ ■ , Wn are independent random 
variables with EWi = and \Wi\ < M for i = 1, . . . ,n. Suppose that V > 
^27=1 var Wj. Then, for any ^ > 0, 

p(Yl^i>^ <exp{-iAV(y + iMA)}. 
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We have 

(107) P{w>w)=P{S{w)>()} = p(^YlWi> A^, 
where Wi = l3{fii + Zi,w) — mi(;Uj, u;), M = 2(c A l)~^w~^ and 

n 

A = ^-mi{fii,w). 

i=l 

Define sets of "smah," "medium" and "large" co-ordinates, 

S = {i:\fii\<C'}, M = {i:C^ <\^l^\<C/2}, C = {i:\fii\>C/2}. 

For the nearly black case, it suffices to consider only two classes, coalescing 
M and S, but it is notationally simpler to use the same argument as for ip. 
Using the three parts of (89), 

Y.miif,„w) < J2[-miw) + CiCf,]] + C\M\<PiC/2)w~^ + \C\w~\ 

On the £p-ball £p[r]], we have #{i : > t} < nrft~^ and so 

(108) |cS| > n - uTfC, \M\ < mfC, \C\ < n^l^C"^. 

On the set 5, we have /x? < l/ijl^C^"^, and so, on making use of (101), 
^mi(^j,i(;) < —nfh{w) + Cnri'^(~'^w~^ 

X [wC'^Pm{w) + CiwC^P-^ + C^Pq){C/2) + 1] 
< -n[Th{w) - CffC-Pw-^] 
= —nrh{w)[l — CC~'^] < —^nrh{w) 

for w <wq. Consequently, A > ^nrh{w). 

We now obtain a bound onV = J2 ■ Using the same decomposition 

into small, medium and large coordinates, we have from the three parts of 
(90), 

V <^m2{fii,w) < C3\S\Cw-^m{w) + C\M\w-^C^(l){C/^) + \C\w-^. 
Using now (108) along with (101), we find that for sufficiently small w, 

V < CnCw~^m{w) + CnrfC~^w-^^{C/2) + CnrfC^vj-'^ 
< Cnw-^m{w)[C^ + C^^"''"V(C/2) + C"1 
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Turning to the exponent in the Bernstein bound, we have for w <wq, 



y + (i/3)M^ 



\ < h 



2>A ~ ii?fh{w)'^ nrh{w) 
= C{nwrh{w)}~^ . 
Therefore, applying the definition of fj, 

V + il/3)MA - ^""^'^(^) ^ CnffC" > C{lognf[logr'^]^^~P^/\ 

Define no(7) so that n > no if and only if n/log^ n > rj^^. If k> p, then for 
rj <rjo and n>no we have t)"^ = mm{r]~P,n/log^ n} > rj^ ^, while if k < p, 
then fj~P < n and > —1/2. In either case we have 

[log?]"^]^""*'^^^ > C(logn)~i/2_ 

Applying the Bernstein inequality to (107) concludes the proof of (106), so 
long as w < Wo (7)- Use (101) to define 770 as the value of fj corresponding to 
wq^ and then set r/o = % to arrive at the first statement of Lemma 10. □ 

8.2. Small signals: risk behavior. We apply risk bound (71) of Lemma 
5. Bound (54) permits the inequality to be rewritten in terms of the pseu- 
dothreshold C- We have, for all values of /ij, 

(109) E\fL, - < C7{i^,r + c~'m + (1 + c)p{c < 0'/'}. 

If fj is sufficiently small [less than rjQ = 7?o(7)i say], then we may use bounds 
(104) and (103) along with the probability bound (106) to yield 

E\fli-^i^\'' < ^{1/^,1" + ,7?(logr?-?)(''-P)/2 + (log77-P)^/'e-^i°s'^'"}. 

If n > no(7), we have exp{— C(logn)^/^} < n~-'^(logn)^~^/^ < fjP {log fi~P)~P/'^ , 
and we finally obtain 

E\^i^ - < ciifiii" + friiogrn^'-p^^^}. 

If fjP = n~^(logn)^, then logry"^ = logn — 2 log log n ~ logn so that, in 
general, 

r/P(logr/-P)('?-P)/2 < max{/(log^?-P)(''"P)/^C7n-l(logn)2+(9-P)/2}. 
Combining the last two expressions and summing over i yields 

RqiO=n'^ E E\l2,-fl,\'l 

<c|n~i E |^*r + ??^(log??~n^^~''^^V^~niogn)2+(''-P)/2l. 
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li q <p, we have by Holder's inequality 

(110) 1/^*1" ^ (^~' E l^^lT^' - 

and also ?]P(log77~P)('?~P)/^ < Cp^qrj'^ for < e~^. If q > p, we have |;Uj|'' < 
and so, using the property that if p = at most m] of the terms 
will be nonzero, 

I/^»I<C 

In every case then, for fi£ £p[r]], rj < r/o(7) and n > no (7), we have 

(111) i?,(C)<C{rp,,(7?)+n-niogn)2+(«"P)/2}. 

Before leaving the consideration of small /ij , consider the case where there 
are no constraints on fi at all. The application of the elementary risk bound 
(63), along with < 1, then yields an absolute bound on the average risk 
for small fi: 

(112) E\fH- fiil'' <n-^ ^ (l + |^f»r)<2. 
l/^»l<i l/i.»l<i 

8.3. Large signals: upper bounds for thresholds. Define -7r(r; fi) = n~^^{i : 
> t}. We will be interested in deriving upper bounds on the estimated 
pseudothreshold ( when it is known that ■S'(r; /x) > vr for appropriate choices 
of T. 

Choose wq small enough so that both (81) and (83) apply. Define 

(113) w{t, vr) = supjw < wq : 7rmi(r, w) > 2rh{w)}. 

Since mi{T,w) /m{w) — > cx) as w; — > 0, certainly w{T,7r) is well defined. On 
the pseudothreshold scale, we write Ct.tt or C{tj^) for P~^{l/w{T,ir)). 

Lemma 11. There exist C = C{'y) and ttq = 7ro(7) such that if tt < ttq, 
then for all t>\, 

(114) sup P^(C>Cr,.)<exp{-C7nC-V(Cr,J}. 

Proof. If nvr of the /ij for which > r are shrunk to itr, and all 
the other are set to zero, then the distribution of each + Zi\ will be 
stochastically reduced. Since P{y,w) is an increasing function of \y\ for each 
w, it follows that S{w) will be stochastically reduced, and so P{S{w) < 0) 
will be, if anything, increased. Thus, the maximum value of P(C > C) subject 
to the constraint that at least nvr of the exceed r will be taken when 
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exactly rnr of the are equal to r and the remainder are zero. We shall 
therefore assume that this is the case. 

We now return to the problem of bounding the probability that S{w) is 
negative, for w = w{t,tt). We have, following (107) but changing the sign. 



P{w <w) = P{S{w) <0) = p}^J2Wi> Aj, 
where, on this occasion, 

n 

Wi = mi{fXi,w) - fi{fii + Zi,w) and A = ^ mi (/Xj , -u;) . 

1=1 

Just as above, \Wi\ < 2cow~^ for all i. To obtain a bound on A, we have, 
making use of the definition (113) of w, 

A = {1 — 7r)mi(0, w) + 7rmi(T, w) 

> —^7rmi{T,w) +7rmi(r, tt;) = ^7rmi{T,w). 

We now seek an upper bound on the sum of the variances of the Wi. 
Making use of the bound (90) for 7712(0, lu), bound (83) for m2{T,w) and 
(113), 



n 

n ^y^varWi < ?n2(0, w) + Trm2{T,w 

< C'w~^TTmi{T,w). 

Substituting into the expression needed for the application of Bernstein's 
inequality, we have 



n 



so that 



^2 

> CnwKmi{T, w) 



V+{1/3)MA 

> Cnwm{w) > CnC~^P{C)-^g{C) 

(since w <wq). □ 
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8.4. Large signals: risk behavior. Let C = C(r, 7r(T; /i)), where r remains 
unspecified for the moment. For each ^j, we have from (77) and (52), 

E\fii - < C{1 + C + (logn)5/2p(C > C)'/'}. 

We then consider two cases. If > logn, then the right-hand side is bounded 
by C(l + 2C'^). On the other hand, if < logn, then 

nC-^exp{-iC^} > C7iexp{-|C^} > Cn^'^, 

so that from (114), 

(logn)«/2p(<f ><^)l/2<log 

nexp 

if n > riQ. It follows that, for sufficiently small tt and n> n^, whether or not 
> logn, 

(115) E\fLi-^i\'i<C{l + C). 
Hence, 

(116) R,{T)=n-^ E\fi,- <C^TUi)[l + C{TMr,M- 

For the global risk bound needed for Theorem 1, we set r = 1. Let vr = 
7f(l;//). We seek a bound for C = Ci^.T^)- Since m{w) x C^'^giO by (81), it 
follows that for sufficiently small vr and, hence, w., 

2m[w) 

Taking logarithms, we have 

log vr"^ > c — K log C + C 
and, hence, for sufficiently small vr, 

(117) C'' = C(l,^)^<2^(log7r-i)''. 

In combination with (116), this yields, regardless of the value of vr = 7f(l; /i), 

(118) i?g(l)<C7r[l + (log7r-i)2]<C. 

Write Ci for the pseudothreshold C(^)''^) defined by (101). Our main goal 
now is to establish a large signal complement to inequality (111), namely, 

(119) R,{Ci) < C{rp,,(r/) +n"i(logn)2+(''-P)/2}. 

The approach will be to apply Lemma 11 with t = ^2 = C2(/^*) defined by 

(120) C2 = C(Ci,vr), 7r = ^^(Ci;;u). 
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Let us first verify tliat, as one would expect for fi G ipii]], C2 > Ci- Since 
rh{w) X ^^"^^(C) by (81), and mi(Ci,u^i) ~ {2wi)"^ by (84), we have, us- 
ing (102), 



mi{Ci,wi) _ l/3(Ci)^i_. a 



m{w,) "2<7(Ci)^' "0(Ci)"^ 

For p > we now use the bound ^ IfJ-il^ < nrjP, while for p = we simply 
use TT <rj. Both cases are encompassed by the inequality 

(121) 7r<rf(-^<ffCi', 
and so 

'mi{Ci,wi) / ^ 



m{wi) 



< CCr'^TT-^ < 2^-^ for Ci large. 



which shows that C2 > Ci (and, in particular, that C2 > 1)- 
In this notation the bound (116) becomes 

R,iCi)<C7r{l + CD <C7rCl 

Although (121) places an upper bound on vr, in fact, it may be arbitrarily 
much smaller. The analysis to follow considers separately cases in which tt 
is comparable to, or much smaller than, if Ci^ ■ 

Recalling the lower bound (82) that mi(Ci,'w) > \j3{C)^{C, — C,i), it follows 
that C2 < Cs = C(^3)i where is the solution to 

(122) ^{C{w)-Ci)=A'K-^wm{w). 

(^3 is intended as a more manageable version of ^2- 
Suppose first that Cs > Ci + 1- Then from (122), 

'-<! <^ r^/-g+K-lg(C3) C3 - Cl 



m) m - C: 



Using (103) and the fact that ^3 — > ^^"'"'^e"^''^"''^^''^ is decreasing, at least for 
Cs > Cl + 1, we get 

vrCl < C7?Pc!~^(Ci + l)"+''e-^^ 
From (104), we then conclude that for Ci sufficiently large, 

ttCI < max{c/(21og7?-P)-\i"+^+«-Pe-'^\Cn-icr^+''"''e"^n 
< Cmax{/(21ogr7-P)^^n~^}. 
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Now suppose that Cs ^ Ci + 1 (and, hence, C2 £ [CiXi + !])• Since $(C3 — 
Ci) ^ it follows that (^3 is smaller than the solution to 

.1(1) = iwmiw)7r~' X r"'||y^~' >^ ^-^(Ovr-^ 

Taking logarithms, the equation becomes 

^2/2 _ _ 1) log C + log c = log 7r-\ 

from which it follows that 

Cl < 2 log 7r~^ + log log TT-^ + C. 

Consequently, since vr [21og7r-^ + loglogvr^i + C]'?/^ is increasing in vr for 
sufficiently small vr, and vr < rfCi^^ we get 

^Cl < CrfCr''[21og(7?-^Ci)]'^^ < Crf (21og?r^)'^^(21og7?"^)~^/l 

li q < p, the right-hand side may be bounded further by Cr]'^. If q > p, 
consider separately the two cases r/^ > log^ n and < log^ n. In all 
cases we obtain (119) for sufficiently small r/ and n > no- 

To complete the proof of Theorem 1, combine the bounds (112) for Rq{l) 
and (118) for Rq{\). For the adaptivity bound, similarly combine bounds 
(111) for Rq{Ci) and (119) for Rq{Ci). 

9. Proof of Theorem 2. The proof of Theorem 2 requires small but sig- 
nificant modifications to the proof of Theorem 1. 

Consider first the case rf > log^ n, so that fj = r]. To show that parts 
(a) and (b) of the theorem remain true with fiA in place of ft, simply observe 
that 

E\5{X,iA) - = ^{1^(^,0 - l2\'',t<tn} + E{\6{X,tA) - f^\',i>tn}. 
Ignoring the event {t < t„} in the first term leads to 

RqiflA, ^J) < Rqifi, At) + Sq{flA, pt), 

where 

(123) 5g(A5,/i)=n-i^^{|5(Xi,tA)-/ii|'',t>tn}. 

The superscript F emphasizes the fixed threshold tA- 

The bound of Theorem 1 applies to Rq{fi,fi), so it remains to consider 
Sq{fij^,fi). Analogously to (100), decompose (123) according to terms with 
large and small values of /Xj, obtaining 

(124) Sq{fi^,fl)=Sq{T)+Sq{T), 
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where, for example, 

Sq{T) = n~^ J2 E{\6iX,,tA)-fii\'',i>tn}. 

We will need the following risk bounds from Section 6. First, from (71), 

(125) E\5{X,tA) - /ul" < cjl/zl" + t^"V(tA)}, 
while from (70), for any event B, 

(126) E{\6{X, tA) - B} < 4(t^ + 6" + l)P{By/^. 

Consider first part (a), namely, global boundedness. For small ^Uj one uses 
(125) to obtain 



I"? 



lMd<i 

(127) <c,n"^ J2 {\l^^\' + t'A~''PitA)} 

Ia'«I<i 

<Cg{l + t^^^4>{tA)}<2Cg. 

For large /ij, as in Section 8.4, introduce vr = 7r{l,n) and C(/u) defined as 
C(l,7r), where (^{tjTt) is as defined before Lemma 11. Note throughout that 
C(/^) ^ /5~^(1) > 0. Arguing as at (117), we also observe that 

C(/x)<2 1og7r-\ 

Two cases arise. If CifJ-) > log^^^ n, then 

tA = V2{l + A) logn < cC(/u) < clog7r^\ 

and so, from (126), 

E\rj{X„tA) - t^il" <c{log7r~y 

and hence 

Sg{l)<CTr{logTT~y <C. 

In the second case C{fi) < log^/^n < tn and so, using the property C = 
C{t) >i and Lemma 11, 

p(t>t„)<p(C>t„)<p{C>C(/i)} 

< exp[-CnC(/i)''"V{C(^*)}] < exp(-C7ni/^). 
Consequently, using (126) with B = {f > 

Sgil) < 4^(t^ + 6" + 1) exp(-Cni/^) < cexp(-Cn^/^) log7i < C. 
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Now turn to part (b), adaptivity over ip[r]]. The case q <p is simple; from 
(125) and bound (110), we have 

Sgifi^, /i) < + *A"V(tA)} < Cg7]i + Cn-(i+^) log'/^ n. 

i 

For q > p, we fohow a strategy broadly similar to that of Section 8.4. 
In (124) we take t = (i = ({r],n), the pseudothreshold defined by (101). 
Applying (125) in a similar manner to (127), we find that 

S,{Ci)<c,n-' E {|/^.r + iVV(tA)}<c,/cr^ + c,tVV(tA), 

which is bounded by the right-hand side of (19) in view of (104). 

To bound the large signal term ^^(Ci), we again apply Lemma 11 with 
T = (2 = C2(/^) defined as in (120). We first observe, using (126) with B = 
{t > tn}, that 

(128) S,(Ci) < c7rt^{P(t > t„)}^/2 < crfC''t\{P{i>t^)}'/\ 

Consider now three cases. First suppose that n is such that C2(/^) < in- 
Using initially the property that ^ = C(i ) > a-i^d then appealing to Lemma 
11, we have 

P{t>tn)<P{C>tn)<P{C>C2) 

< exp{-CnCrV(C2)} < exp{-C7ntrV(tn)}. 
Using the definition of t„, and the fact that t^~^ > 1 for n > 13, 

<-V(tn)><^(0)log5/2n. 

Hence, from (128) and using the fact that rj <rjQ, 

5g(Ci) < Cr7P(log?i)9/2exp(-Clog5/2n) < Crf < Crp^q{ri). 

Second, consider /x for which ("^ > logn. In this case 21ogr/~P x Ci ^ logn, 
and so, from (128) 

SgiCi) < Cr?P(logn)(''-f)/2 < Crp,,(r?). 
Finally, suppose both < logn and C2(/^) > ^n- In this case 
Cs - Ci > C2 - Ci > in - logi/2 n > \ logi/2 n 

if n > riQ. We use (122) defining to derive an upper bound on vr. The 
equation implies 

^(C3-Cl)=4^"'?^("^3)//3(C3) 

X 4^-1(3"- V(C3). 
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In other words, using (103), 

7r>^C3~\C3-ci)m)/m-ci) 

<C3exp{-(C3-Ci)Ci}0(Ci) 
>cr/Pc!-%^exp{-(C3-Ci)Ci}. 
Using the first inequahty of (128), we have 

4(Ci) < Crft\ct'<^M-iC3 - CiXi} 

< Crft\{Cz - Ci)«+^exp{-(C3 - Ci)Ci}, 

where we have used the fact that Cs ~ Ci ^ ^log^^^n > so that C,z < 
3(^3 — Ci)- Using these properties again, as well as the property that Qi > 
/3~^(1), we have 

Sg(Ci) < C77P(logn)('=+i+5)/2g^p(_^^^/j^) < fj^^ Crp^M- 

This completes the proof that the results of Theorem 1 continue to hold for 
the modified estimator for 7]^ > log^ n. 

Now turn to the case rjP < n~^log^ n. Reuse decomposition (100) with 
T = C,{ri,n) defined after (101). First use bound (71) with t = tA'- 

(129) E\fiA,i - < c^WmV + i^"V(iA) + {t\ + + ^){P{iA < tA)V^% 
By the definition of tA, we have 

(130) tW(tA) = (/'(0)n"i-^[2(l + ^)logn]('?-i)/2 < Cn-^-^ log^^-^^^ n. 

To bound P{tA < tA), observe from (53) that t^(C) > ~ C*- Iii combina- 
tion with (105), this implies, for rj^ < n^^log^n, that 

t\C)>tl + ploglogn-c-C, 

so that t^(C) > tn for n > n{p,^). Consequently, 

{tA < tA} = {i < tn} C {i < t{C)} = {C < C} 

and so we conclude from (106) that when rj^ < log^ n and n > n{p, 7), 

(131) it\ + 6" + l)PiiA < tA) < c(logn)«/2exp{-C(logn)3/2} = o(n"i-^). 



We now have 




for q <p, 
iov q > p > 0. 
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Averaging (129) over all i and inserting (130)-(132) proves (23) for the case 
q <p and (24) for q> p. 

To prove (23) for q > p, argue as in Section 8.2 to give 

(133) l^^l" <Crp,g{rj), 
so that, summing only over < 

(134) RgiC) < Crp,g{ri) + cn"!"^ log^/^ n. 

For q> p and \f-ii\ > (, we apply bound (78), noting that < with 
probability one, to obtain 

mA,i - < 4(t^ + + 1) < C(logn)«/2_ 
As in the previous section, comparing (120) and (121), we have 

7r = n-i#{i:|/x,|>C}</r^ 
and so, recalling from (105) that C > ^/\ogn, 

But r]P < log^ n implies that logry"^ > logn — 2 log log n and, hence, that 
logn < Clogr]~P, so for n large and rf <n~^ log^n we have 

combining this result with (134) completes the proof of Theorem 2. □ 

10. Remarks on the posterior mean. In proving results for the pos- 
terior mean, we have assumed throughout that q> \. The failure of the 
posterior mean to be a strict thresholding rule has a substantive effect on 
the overall risk if g < 1. Concentrate attention on the case where 7 is the 
Laplace distribution with parameter 1, and define il]{n) = exp(-^21ogn) so 
that g{^J2\ogn)~^ x ilj{n) as n — > 00. 

An important contributor to our arguments was t{w) < \/2Togn, from 
which it follows that t^w) < \/21ogn. By the definition of r, we have 

w 0(V21ogn) _i 

Tj r > . . > Cn TP{n), 

1 — w g[^/2[ogn) 

so that w > Cn~^'ip{n). Since g{x)/(j){x) is bounded below away from zero, 
it follows that, for some constant C and for all x, the posterior weight 

w{x,w) > Cn~^ilj{n). 
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On the other hand, the odd function /xi satisfies fl'i{0) > and is strictly 
increasing with jli{x) > x — A for all x; therefore, x~^fli{x) is uniformly 
bounded below away from zero for 2: 7^ 0. It follows that, for all x 7^ 0, 

\fl{x,w) \ = w{x,w)\fli{x)\ > C\x\n~^\(j{n). 

If /i = and X ~ N{0, 1), it follows that 

E\fi{X,w) - > C^-^H'^^l^r = Cn-VN" 

so that, however small the value of rj, the risk bound cannot be reduced 
below Cn~'^ip{nY , making it impossible for the estimate to attain the full 
range of adaptivity given by the posterior median. The restrictions become 
more severe the lower the value of q. 
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